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’•RRA^rA 

Chapter  4 

pages  4-43,  4-44,  4-4‘3;  "hOCTRU”  to  read  as  ’’LOCATION" 

page  4-56,  a liter  last  lino  should  bc^  appended: 

"2.  Handling  of  the  special  constructs  used  to  change 

terminal  characterir.tics  and  the  system’s  responses 
to  the  terminals." 


Chapter  5 

page  5-12,  under  Physical;  should  read  as  "Size;  1.2"  x 
11.5"  X 27.5" 


page  5-38,  under  Storage  Capacities ; should  rcc4d  as  "65,536 
words/ module" 

page  5-53,  under  H,  line  3 to  read  as  "000000011" 

page  5-54  bottom  of  page.  The  modified  equation  is 

Port  = 32  X (EMno  MOD  512)  + 2 x ( DMno  M0D4 ) -f  1 
for  512  < RMno  <527 

which  allows  bettor  distribution  of  the  spare 
modules . 

Chapter  7 

page  7-7,  paragraph  4,  line  7;  "CU"  should  read  as  "CR" 

page  7-18,  paragraph  1,  line  5 should  read  as  "would  appre- 
ciably improve  throughput." 


Appendix  A 

page  A-3,  equation  A.l.  f^  and  are  the  number  of  floating 
point  operations  and  the  execution  time  of  the  ith 
piece , respectively. 

page  A-23,  third  bullet,  line  1;  "The  correct  algorithm"  should 
read  as  "The  given  algorithm" 

page  A-35,  paragraph  2,  line  2;  LAX  should  be  deleted. 

page  A-40,  paragraph  2,  line  4 "NJ(J)  should  read  as  NM(J)" 

page  A-59,  paragraph  2,  line  8 should  read  as  "the  physical 
problem  needs  to  be  retained" 

page  A-63,  second  equation  "T£;|^=144"  should  read  as  + 144" 


ERRATA  (Continued) 


Appendix  B 

page  B-32,  "60%"  column,  "Double  Omega,  512/512"  row, 

"0.504"  should  read  as  "0.0504". 

page  B-'40,  paragraph  2,  line  2;  "of  that"  should  read  as 
"of  requests  that". 

page  B-45,  paragraph  4,  line  5;  "0  i 10"  should  read  as 
"0  ^ i 10" 

Appendix  C 

page  C-8,  paragraph  1,  line  9;  "SOTREM"  should  read  as  STOKEM" 
page  C-24,  under  LOADEM,  line  2;  "TN"  should  read  "CN". 
page  C~27,  Under  FILLRE?  "FILIR"  should  read  "FILER", 
page  C-40,  line  4?  "CTIX±"  to  read  as  "CTIXl" 


Appendix  D 

page  D-10  paragraph  1,  lines  6,  7,  8;  "Hence,  (1-Fl)  is  the 
fraction  of  failures  that  cause  a transition 
directly  into  the  INTERRUPT  state,  " should  be 
deleted, 

page  D-12,  under  TIME  BETV^EEN  FAILUPES  (PERMANENT),  line  3; 

" intermittant  type  device  failure"  to  read  as 
"permanent  type  device  failure". 


Appendix  F 

page  F-39,  paragraph  5,  line  9,  "15,38K"  to  read  as  "15.38K". 


Appendix  H 

page  H-15,  equation  H.3  to  read  as  P(A-UPPER=1)  = P( INPUT)  x 
P(0~BIT=1)  X P(l-BIT^l) 
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INTRODUCTION 


This  rc^port  presonts  the  results  oi  Burroughs  Corpor at i (vm  ' s 
efforts  on  the  Feasibility  Study  for  the  Numerical  Ae-rodynanii  c 
Simulation  Facility  (NASF)  . The  study  has  demons t rat. >(i  that  a 
particular  form  and  architecture  for  the  NASF  (proposed  orisjinally 
during  the  Preliminary  Study  [1,  2]  and  improv'Hl  during  Uje 
present  study)  would  meet  the  established  objectives.  The 

Nunvjrical  Aerodynamic  Simulation  Facility  is  conceived  t.)  be  more 
than  just  a very  high-speed  computing  machine.  The  iacjiity  m.ist 
also  include  ail  that  is  required  to  support  the  users  ol.  such  .1 
high-speed  capalulity.  The  ioasibility  study  nmjuired 
consideration  of  ail  parts  of  the  proposed  NASF  system,  I'he  <l<  ptlj 
of  study  of  eacli  part  of  the  system  varieci  ijep.  nui  1 iv  j on  Lii.j 
complexity  of  that  part  of  the  .system,  on  the  impact  of  tlial.  jjart 
on  tl-ie  system  capabilities  and  on  vjhether  or  not  there’  was 

sufficient  prior  knowledge  about  how  to  implement  that  part  of  the 
system . 

The  evaluations  performed  as  part  of  the  study  focused  Lhtaie 

iiuijor  issues.  First  the  ability  of  tlie  proposed  system  architec- 
ture to  support  the  anticipated  workl(;ad  was  evaluated.  Secejnd , 
the  throughput  of  the  computational  engine  (tlie'  FJ  ow  Model 
Proce;.;sor)  was  studied  using  real  application  programs.  Third, 

the  availability  r.,;li  abi  1 ity , and  maintainability  of  tJie  system 
were  modeled.  The  evalu<itions  were  based  on  the  Baseline  Systems 
of  the  Preliminary  Studi.es  [1,  2]  as  modified  where  appropriate 
during  this  study. 

The  results  of  these  evaluations  shov’  that  the  implementation  of 
the  NAvSF,  in  the  form  considered,  would  indeed  be  a fea.sible  i^ro- 
ject  with  an  acceptable  level  of  risk.  The  technology  ret^ui  r..‘ci 
(both  hardware  and  soltv;are)  cither  already  exists  or,  in  tlio  case 
of  a few  parts,  is  expected  to  be  announced  this  year. 

This  report  descril^es  many  ol  the  details  of  the  system  including 
the  hardware  configuration,  user  language,  software,  fault  toler- 
ance, and  other  aspects  ol  the  system  on  which  this  demonstration 
of  feasibility  is  baseci.  The  first  chapter  summarizes  the  sLu<iy 
objectives  the  evaluations  made  and  tlie  results.  Tiie  NA.SF  sy.stein 
archit (future , which  is  the'  basis  of  discussi.jn  throughout  the 
report,  is  described  in  Chapter  2.  The  system-level  loading  anal- 
ysis performed  as  part  of  the  study  is  summarized  in  Cha[jter  2 
while  Chapter  3 reports  on  the  results  of  timing  actual  codes  for 
the  conf igurat ions  assumed.  The  NASF  Software  and  Hardware  deve- 
lopments are  detailed  in  Chapters  4 and  5.  Tiie  various  models 
used  to  evaluate  reliability,  availability,  maintainability,  trur>t- 
worthiness  and  the  results  of  that  detailed  evaluation  are 
included  in  Chapter  6.  Chapter  7 describes  the  models  wiiicli  have 
been  used  used  during  Flow  Model  Processor  ( FMP)  instruction 
timing  simulations.  The  report  concludes  with  a chapter  r vdrioh 
identifies  some  of  the  management  and  control  techniques  which 
could  used  t(j  eventually  manage  a project  of  this  scope.  liven 
more  detail  concerning  most  of  the  arenas  discussed  in  the  re'i).)rt 
is  included  in  the  A[>pendices.  Each  of  the  chapters  includes  .in 
introductory  section  which  can  be  scaiined  to  gain  a gen.-ral 
perception  of  each  parv  of  the  project  after  reading  Chapter  i. 
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CHAPTER  I 

STUDY  OBJECTIVES  AND  RESULTS 


1*1  STUDY  OBJECTIVES 

The  principal  objective  of  the  study  has  been  to  consider  the 
feasibility  that  a facility  (NASF)/  which  could  support  a through- 
put well  in  excess  of  what  would  be  commercially  available^  could 
be  implemented.  In  particularf  the  goal  is  to  have  a system  where 
time-averaged  Navier-Stokes  computation  can  be  performed  in  10 
minutes  or  less  (on  steady  fluid  flow  problems  involving  a million 
grid  points).  Not  only  is  this  throughput  goal  important,  but 
since  the  intent  of  the  facility  is  to  support  daily  usage  by  a 
large  user  community,  the  NASB’  system  availability  needs  to  be 
better  than  90%  and  the  facility  needs  to  be  nominally  available 
for  22  hours  a day.  In  order  that  the  NASF  may  support  long  runs, 
the  mean  time  between  interruptions  should  be  longer  than  ten 
hours.  In  some  cases,  an  alternate  form  of  the  throughput  goal 
can  be  used.  A sustained,  average  rate  of  execution  of  one 
billion  floating  point  operations  per  second  (one  gigaf lop/sec  or 
1 GFLOPS)  corresponds  roughly  to  the  problem  throughput  desired  on 
the  aerodynamic  flow  codes. 

The  starting  point  of  the  effort  in  this  study  was  the  baseline 
configuration  developed  during  the  Preliminary  Study  under 
contract  NAS2-9456  11,2]*  The  overall  goal  was  to  gain  an  under- 
standing of  the  characteristics,  capabilities,  and  potential  of 
the  facility  in  order  to  make  a judgment  as  to  its  feasibility. 
The  study  required  the  development  of  further  specifications  in 
order  to  consider  the  responsiveness  to  the  desired  application  of 
the  facility  and  to  develop  estimates  of  the  schedule,  cost,  and 
risk  of  such  a development. 

Both  functional  and  performance  (timing)  simulators  were  developed 
to  be  able  to  estimate  (as  accurately  as  possible)  performance  and 
reliability  of  the  system.  Although  the  primary  application  of 
the  facility  is  likely  to  be  aerodynamic  flow  modeling,  the  perfor- 
mance studies  included  both  aerodynamic  flow  codes  and  weather 
modeling  codes.  The  use  of  real  programs  in  these  application 
areas  allowed  an  initial  evaluation  of  the  flexibility  of  the 
language  constructs  proposed,  Tliis  evaluation  was  especially 
important  since  the  facility  needs  to  be  sufficiently  flexible 
that  algorithm  development  could  be  supported  for  fluid  dynamics 
algorithms  as  yet  not  investigated.  In  addition,  the  diverse  user 
needs  for  input,  output,  and  algorithm  investigation  must  be 
supported. 

Since  the  development  of  the  baseline  systems  considered  only  aero- 
dynamic flow  modeling  applications,  the  consideration  of  weather 
modeling  codes  was  especially  important.  This  consideration  was 
used  to  evaluate  the  flexibility  of  the  system  as  far  as  its 
support  of  other,  related  application  areas  and  was  used  to  deter- 
mine whether  further  improvements  might  be  needed  to  support  these 
additional  applications. 
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All  oZ  the  goals  could  be  met  by  the  system  described  as  a 
possible  NASF  conf iguration*  No  hardware  modifications  would  be 
needed  for  weather  code  optimi  j^ation*  Some  minor  software  ex  ten-- 
sions  were  proposed  based  on  the  weather  code  evaluations* 

1.2  SYSTEM  DESCRIPTION 

Before  describing  the  system  evaluated  during  this  study,  the 
importance  of  considering  all  aspects  of  the  facility  must  be 
emphasized.  During  the  development  of  the  system,  the  focus  tends 
to  be  on  the  hardware  and  system  software  (such  as  operating 
systems  and  compilers).  As  shown  in  Figure  1.1,  such  a focus  is 
limited.  If  only  the  system  expense  is  considered,  the  other 
areas  important  to  the  successful  utilization  of  the  facility  may 
be  slighted.  In  particular  users  themselves  face  both  the 
expense  of  their  training  in  the  use  of  the  system  and  the  day  to 
day  expense  of  developing  and  using  their  various  application 
programs.  This  usage  would  include  algorithm  development,  program- 
laing,  model  description  data  reduction,  and  so  on.  The  users 
must  be  supported  by  a staff  and  whatever  other  support  might  be 
needed  to  keep  the  facility  operational.  Such  support  might 
include  operators,  power,  cooling,  training  and  supplies. 

Although  the  consideration  of  all  these  factors  complicates  the 
development  of  the  facility,  these  factors  must  be  carefully 
considered  in  order  to  have  a facility  that  would  not  only  be 
economical  to  acquire  but  also  be  economical  to  use.  The  system 
described  below  did  consider  these  factors. 


USERS 

WITH 

ANSWERS 


Figure  1.1  Total  Cost  of  NASF  Usage 
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l*2*i  Hardware 

The  system  orlcjinally  defined  during  the  Preliminary  Studies  and 
modified  during  this  study  is  shown  conceptually  in  Figure  1.2* 
The  Flow  Model  Processor  (PMP),  which  provides  the  required 
computational  power,  is  a dedicated  computing  engine  with  an 
architecture  based  on  the  special  needs  of  modeling#  The  Supporl 
Processor  the  Peripheral  Support  System  and  the  Pile  Systeiu 
together  constitute  the  Support  Processing  System#  The  Support 
Processing  System  interfaces  with  the  users,  maintains  the  data 
files,  and  controls  the  flow  of  jobs  and  data  to  and  from  the  FMP. 
Not  shown  in  the  figure  are  the  support  elements  including 
building,  power,  office  space  and  cooling# 

The  architecture  of  the  Plow  Model  Processor  is  based  on  the  needs 
of  discrete  modeling  and  simulation#  The  FMP,  which  is  described 
in  more  detail  later,  has  512  processors  that  normally  would 
execute  independent  of,  and  concurrent  with  each  other*  A coordi-- 
nator  is  used  to  allow  the  processors  to  execute  in  synchron- 
ism# The  processors  each  have  memory  space  for  programs  and  data. 
In  addition,  a large  memory  (called  the  Extended  Memory)  can  be 
accessed  by  all  processors  through  a high-speed  network  called  the 
Connection  Network#  The  Extended  Memory  normally  would  contain 
the  data  common  to  the  processes  being  independently  evaluated 


Figure  1.2  NASF  Organization 
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each  oi  the  processors*  Finally  a slower  statjiiuj  memory  (called 
Data  Base  Memory)  would  be  provided  to  hold  the  next  job,  the  last 
job  and  the  current  job#  The  Data  Base  Memory  buffers  programs 
and  data  in  order  to  provide  a smooth  flow  of  tasks  to  and  from 
the  FMP,  The  memory  sizes  assumed  during  the  study  were  based  on 
the  aerodynamic  flow  codes  that  are  expected  to  be  the  primary 
application  on  the  FMP* 

The  Support  Processing  System  would  consist  of  three  portions;  the 
Support  Processor,  the  File  System,  and  the  Peripheral  Support 
System.  The  Support  Processor  (the  host  proce'^'Sor)  would  run  the 
main  portion  of  the  operating  system  (called  rhe  Master  Control 
Program) . A dual-processor  B7800  was  assumed  for  evaluation 
purposes.  Most  of  the  user  interaction  with  the  NABP  would  be 
through  the  Support  Processor.  The  Pile  System  includes  disk 
packs,  an  archival  store,  and  the  manager  of  the  files.  Data 

paths  to  and  from  the  files  would  exist  for  the  PMP,  for  the 

Support  Processor,  and  Cor  user  support.  The  third  element 
considered  as  part  of  the  Support  Processing  System  is  the 
Peripheral  Support  System.  The  Peripheral  Support  System  has  been 
included  because  the  evaluations  performed  in  the  study  deiT\on- 
strated  that  at  least  one  of  the  supportive  tasks  involved  such  a 
level  of  work  that  a special  processor  for  that  task  should  be 

considered.  In  particular,  the  evaluations  demonstrated  an  except- 
ionally heavy  load  can  be  expected  to  support  Computer  Output  to 
Microfilm  (COM).  This  load  may  be  in  excess  of  10,000  frames  of 
graphic  information  per  day.  The  Peripheral  Support  System  would 
include  facilities  specially  designed  to  support  such  exceptional 
loads  in  order  to  improve  the  load  balance  across  the  entire 

facility. 

1.2.2  Software 

Not  5hown  in  Figure  1.2  is  the  software  which  would  be  used  to 
support  users  and  to  control  the  efficient  usage  of  the  resources 
within  the  facility.  A dialect  of  FORTRAN,  called  FMP  PORTION, 
has  been  proposed  which  has  a few  simple  extensions  to  standard 
FORTRAN.  These  extensions  provide  application-oriented  approaches 
to  use  both  the  independent,  concurrent  mode  of  operation.  In 
addition,  statements  are  included  which  are  capable  of  using  a 
large  number  of  processors  at  once  on  a single  computation.  Since 
the  Support  Processor  would  be  a commercially  available  processor, 
standard  languages  such  as  ALGOL,  FORTRAN,  and  COBOL  would  be  used 
for  process  definition  on  that  processor.  The  Pile  System  would 
not  be  programmed  by  the  users,  but  would  provide  high-level  file 
management  and  access  capabilities* 

The  NASF  operating  system  (called  the  Master  Control  Program,  or 
MCP)  would  reside,  in  part,  on  all  elements  of  the  system.  Since 
the  Master  Control  Program  (MCP)  would  be  based  on  existing 
software,  the  major  portion  would  reside  on  the  Support  Processor. 
The  portion  of  the  MCP  on  the  FMP  would  manage  the  flow  of  jobs 
within  the  FMP  and  would  be  the  primary  focus  of  contidence  and 
diagnostic  procedures  within  the  FMP. 
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Since  the  PMP  v/ill  have  between  200/000  and  250/000  integrated 
circuits/  plus  other  components,  both  hard  failures  and  transient 
failures  can  be  expected*  Means  for  preserving  the  integrity  of 
the  computation  in  the  face  of  such  failures  must  be  provided* 
The  level  of  Large  Scale  Integration  to  be  used  is  expected  to 
bring  forth  failure  modes  that  have  not  been  important  in  the 
past/  such  as  background  radiation  which  may  cause  transient 
errors  in  Data  Base  Memory.  Defense  against  all  these  possibi- 
lities must  be  included,  and  has  been  included  in  the  architecture 
described  in  this  report.  Where  economically  feasible,  mechanisms 
for  error  correction  have  been  included  such  as  use  of  single 
error  correction,  double  error  detection  (SECDED)  codes  in  all 
memories.  To  reduce  the  probability  of  double  errors  in  those 
memories  where  transient  failures  may  be  expected,  mechanisms  to 
"scrub”  the  memory  by  rewriting  data  back  into  memory  with  the 
errors  corrected  are  provided.  For  the  various  types  of  faults 
which  can  be  detected  but  are  not  easily  corrected,  on-line  spare 
processors  and  memory  modules  can  be  automatically  switched  in 
under  control  of  the  MCP  to  replace  failed  elements. 

Not  only  was  the  FMP  considered  when  developing  the  necessary 
fault  tolerant  aspects  of  the  system.  The  CPU  in  the  B7800 
Support  Processor  is  duplexed,  for  example,  as  are  the  Data 
Communications  and  Input  Output  Processors.  A distributed  control 
scheme  and  a multiplicity  of  disk  packs  within  the  Pile  System 
serve  to  keep  the  system  available  for  useful  work  without  having 
each  and  every  one  of  them  available  at  any  given  instant.  The 
automatic  recovery  procedures  in  the  software  not  only  support  the 
PMP  as  mentioned  earlier,  but  exist  as  a standard  part  of  the  MCP 
in  the  Support  Processor. 

1.3  NASF  EVALUATION 

Evaluation  of  the  NASF  considered  many  aspects.  Three  specific 
issues  received  the  major  attention  in  terras  of  analysis  per- 
formed. These  issues  were  an  evaluation  of  sys'cem-level  capabili- 
ties to  support  the  general  work  load  of  the  facility,  an  evalu- 
ation of  the  throughput  of  the  FMP  using  real  programs,  and  an 
analysis  of  the  availability,  reliability,  and  maintainability  of 
the  system*  The  general  approach  used  for  the  evaluation  and  the 
results  observed  is  described  below  for  each  of  these  three  areas. 
As  a result  of  these  evaluations  and  the  other  work  to  date,  those 
areas  which  contribute  to  the  risks  of  the  program  were  identi- 
fied. These  areas,  which  relate  to  the  assurance  of  success  of 
the  program  are  explained  below. 
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1 • 3 ♦ 1 System  Utilization  Studies 


The  evaluation  of  the  NASP  system  organization  showed  the  feasibi- 
lity of  the  system  to  support  the  expected  workloads*  This  evalu- 
ation was  based  on  a hypothetical,  but  well  thought  out,  workload 
supplied  by  NASA  [4].  System-level  models  were  developed  and  used 
as  the  basis  of  the  implementation  of  system  analyzer  programs* 
The^  models  were  operationally  based  so  that  they  may  be  easily 
verified  by  direct  observation  of  an  actual  system  as  development 
might  progress* 

The  system-level  evaluation  included  consideration  of  the 
following : 

PMP  Loading 

Support  Processor  CPU  Loading 

Average  Data  Transfer  Rates  between  Files,  Users  PMP  and 
Support  Processor 

Expected  number  of  file  management  actions  such  as  file 
creation,  deletion,  and  accessing. 

The  results  of  the  evaluation  show  that  the  dual-processor  B7800 
assumed  could  comfortably  handle  the  expected  load  with  the  excep- 
tion of  the  COM  support  activities  discussed  earlier.  More  signif- 
icantly, if  projection  is  made  to  equivalent  processors  which  are 
likely  to  be  available  before  the  implementation  of  the  facility, 
such  processors  could  handle  a significant  amount  of  the  COM  sup- 
port load*  The  average  data  transfer  rates  projected  by  the  anal- 
ysis are  well  below  the  channel  capacities  planned*  Although  more 
analysis  of  peak  rate  requirements  has  yet  to  be  performed,  the 
projections  to  date  are  consistent  with  the  expected  results* 

1*3*2  Flow  Model  Processor  Throughput  Evaluation 
Throughput  of  the  PMP  was  evaluated  by  measuring,  in  simulation 
and  by  analysis,  its  performance  on  complete  programs  supplied  by 
NASA*  The  use  of  entire  programs  for  measuring  performance  avoids 
a common  pitfall  in  predicting  the  performance  of  new  and  advanced 
computers,  namely  the  reliance  on  throughput  evaluations  which 
look  only  at  the  "hard"  parts  of  the  problems,  which  also  are  by 
no  coincidence  the  parts  of  the  problem  that  the  advanced  computer 
is  designed  to  work  best  on* 

The  results  of  the  analysis  of  the  two  aerodynamic  flow  codes 
(referred  to  as  aero  flow  codes)  show  that  the  goals  for 
throughput  for  aero  flow  applications  are  met*  One  aero  flow 
code,  identified  as  the  "3D  implicit"  code  v;as  projected  to 
execute  in  less  than  five  minutes  at  a throughput  rate  of  1.01 

billion  floating  point  operations  per  second.  The  second  aero 

flow  code,  identified  as  the  "3D  explicit"  code  was  projected  ^o 
execute  in  less  than  seven  minutes  at  a throughput  rate  of  0*89 

billion  floating  point  operations  per  second*  Both  codes  were 

evaluated  at  the  nominal  size  expected  to  run  on  the  PMP,  specif- 
ically one  million  grid  points* 
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The  results  of  the  analysis  of  the  v/eather  codes  shows  that  the  x 

PMP,  as  evaluated,  is  optimized  for  the  weather  codes  as  well. 

NASA  supplied  two  weather  (or  climate)  codes.  The  first  was  a 
version  of  the  Mintz-Arakawa  algorithm,  as  developed  by  the 
Goddard  Institute  for  Space  Studies  ("GISS"  )}  the  second  was  a 
spectra*  weather  code.  The  same  detailed  analysis  was  applied  to 
the  GISS  weather  that  had  been  applied  to  the  aerodynamic  codes. 

Fourteen  days  of  simulated  weather,  with  20  minute  time  steps,  in 
a 2.5^  (latitude  and  longitude)  model  with  a total  of  115,334  grid 
points,  would  take  8 minutes  to  run  on  the  FMP  with  an  effective 
throughput  rate  of  0.53  billion  floating  point  opertions  per 
second.  Scrutiny  of  the  second  weather  code  showed  that  it  could 
be  expected  to  run  with  slightly  higher  throughput  than  the  GISS 
weather,  but  the  detailed  analysis  was  not  made. 

The  analysis  was  very  thorough.  All  programs  evaluated  were 
dissected  into  code  segments,  each  of  which  was  internally  homo- 
geneous. The  throughput  was  estimated  for  each  individual  code 
segment.  Prom  an  analysis  of  how  often  each  code  segment  was 
executed,  the  individual  throughput  estimates  were  combined  into 
an  overall  execution  time  and  throughput  rate. 

As  a verification  of  the  hand  analysis,  sections  of  code  were 
input  to  an  instruction  timing  simulator.  The  code  sections 
chosen  for  simulation  verified  throughput  rates  ranging  from  less 
than  0.1  GFLOPS  to  more  than  1.5  GPLOPS.  The  instruction  timing 
simulator  was  based  on  a reasonably  detailed  model  of  a processor 
in  the  FMP.  The  instruction  times  assumed  in  the  model  correspond 
to  what  could  be  expected  using  good  engineering  practices  and  a 
modern  circuit  family  such  as  the  Fairchild  lOOK  family  of  ECL 
circuits.  The  times  assumed  in  the  model  for  access  to  the  common 
memory  via  the  Connection  Network  were  based  on  detailed  analysis 
of  the  Connection  Network  itself.  A CN  simulator  was  developed 
and  used  to  analyze  various  access  patterns  including  some  taken 
from  the  aero  flow  codes.  A stochastic  analyzer  was  used  to 
determine  the  probability  of  success  in  making  connections.  The 
stochastic  analyzer  used  probability  equations  for  analysis.  Both 
methods  validate  a transfer  rate  through  the  connection  network  of 
over  one  billion  words  per  second  from  all  processors  to  all 
memory  modules. 

The  analysis  of  the  various  programs  required  preparation  of  FMP 
FORTRAN  versions  to  be  used  in  the  analysis  and  as  the  starting 
point  for  hand-compilation  onto  the  instruction  timing  simulator. 

The  conversion  from  the  FORTRAN  code  supplied  to  FMP  FORTRAN  was 
generally  straightforward.  In  some  cases,  significant  reductions 
in  the  length  of  the  code  could  be  made  because  of  the  application- 
orientation  of  FMP  FORTRAN. 
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1*3*3  Availability y Reliability  and  Maintainability  Evaluations 

Several  methods  were  used  to  evaluate  the  availability,  reliabil- 
ity, and  maintainability  of  the  NASP*  The  predictions  for  the  FMP 
are  based  upon  a computer  model  of  reliability  and  availability 
wirh  assumptions  that  ar  ^ derived  from  the  military  standard 
methods  for  estimating  r -^liability*  In  an  attempt  to  be  as 
realistic  as  possible,  field  data  which  included  failures  due  to 
system  software  as  well  as  hardware  was  used*  In  addition, 
intermittant  failure  modes  were  modeled,  where  the  rate  of  inter- 
mittants  was  based  on  field  experience. 


With  the  fault  tolerance  mechanisms  in  place,  the  availability 
forecasts  are  99%  for  the  FMP  by  itself  and  over  99%  for  the 
Support  Processing  System.  These  individual  predictions  combine 
to  an  NASF  availability  of  over  98%.  An  estimate  of  14.1  hours 
between  interruptions  of  processing  was  also  made  as  a result  of 
the  reliability  and  availability  modeling.  These  predictions  for 
the  SPS  are  based  on  field  data  for  the  B7700,  which  is  similar  to 
the  B7800  for  reliability  and  availability. 

1.3.4  Program  Success  Assurance 

To  assure  the  success  of  the  NASF  project,  one  must  assure  success 
in  all  areas*  Some  areas,  being  dependent  mainly  on  existing 
technology  or  existing  methods,  were  only  briefly  addressed  during 
the  study.  Other  areas  of  concern,  especially  where  the  NASF  and 
its  B'MP  represent  a break  with  past  experience,  were  addressed  at 
greater  length.  A discussion  of  some  of  the  key  points  addressed 
is  summarized  below. 

Although  outside  the  scope  of  the  study,  the  need  for  continuing 
committment  to  the  successful  implementation  of  the  NASF  on  both 
NASA's  and  the  vendor's  parts  must  be  carefully  considered.  The 
close  technical  interaction  that  was  so  important  to  the  Prelimin- 
ary and  Feasibility  Studies  must  be  continued.  The  length  of  time 
from  the  eventual  start  of  design  to  delivery  of  the  system  is 
long.  Project  attention  must  be  kept  firmly  on  the  job  at  hand. 
Continual  changes  of  direction,  dilution  of  effort,  and  expansion 
of  goals  could  make  the  project  seem  to  have  a constant  time-to- 
completion.  This  study  has  shown  that  a project  begun  now,  with 
currently  available  or  imminently  expected  technology,  could 
deliver  an  operational  system  which  would  fulfill  NASA's 
objectives* 

Software  development  could  have  several  potential  problem  areas. 
Software  has  been  notoriously  hard  to  schedule,  often  because  of 
incomplete  or  changing  specifications.  Software  is  especially 
subject  to  the  temptation  to  add  "just  one  more  little  feature" 
making  the  resulting  product  more  and  more  complex  and  difficult 
to  tejjt.  This  problem  must  be  handled  by  careful  management.  The 
two  major  areas  of  software  concern  in  the  NASF  are  the  operating 
system,  and  the  language  and  compiler.  The  operating  system 
(called  the  Master  Control  Program,  MCP)  would  be  based  on  the 
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existing  MCP  of  the  B7800  planned  as  the  Support  Processor*  This 
MCP  has  a history  of  19  years  of  development  behind  it  and  is 
already  being  modified  by  Burroughs  to  support  job  flow  to  the 
computational  engine  for  the  Burroughs  Scientific  Processor*  With 
this  v/ork  substantially  complete,  the  integration  of  the  FMP 
becomes  a task  with  much  less  risk* 

Compiler  development  is  another  area  often  assumed  to  be  a problem 
area*  Here  risk  has  been  significantly  reduced  by  proposing  a 
language  which  is  essentially  ANSI  Standard  FORTRAN  with  a 
structure  surrounding  the  FORTRAN  pieces*  This  structure  allows 
the  FORTRAN  pieces  to  map  directly  onto  the  many  individual 
processors  of  the  FMP*  The  result  is  that  most  of  the  compilation 
is  the  same  serial  FORTRAN  to  processor-^level  code  process  that 
industry  and  Burroughs  has  considerable  experience  with*  The 
coordination  between  the  pieces  of  standard  FORTRAN  is  simply 
described  by  the  added  structure  and  maps  easily  onto  the  section 
of  the  FMP  specifically  designed  for  such  coordination  (i*  e.,  the 
coordinator) * 

As  a result  of  the  approaches  proposed  and  evaluated  during  the 
study,  the  success  of  implementation  of  the  necessary  software 
seems  assured* 

Hardware  presents  no  threat  to  the  success  of  the  project*  The 
technology  projections  made  during  the  Preliminary  Study  [2,  3] 

are  proving  to  be  conservative.  Logic  design  would  be  straight- 
forward and  presents  little  in  the  way  of  new  challenges.  The 
organization  considered  is  very  modular  v/hich  would  allow 
implementation  of  the  system  with  only  a few  types  of  modules. 

The  one  area  in  the  hardware  which  represents  a feature  not  found 
so  far  in  any  commercial  computer  is  the  Connection  Network*  This 
network  provides  the  necessary  data  paths  between  the  many 

processors  and  the  large,  common  memory  in  the  FMP*  This  network 
has  been  thoroughly  simulated  and  otherwise  analyzed  during  the 
course  of  this  study* 

1.4  CONCLUSION 

The  work  summarized  above  has  demonstrated  the  feasibility  of  the 
Numerical  Aerodynamic  Simulation  Facility*  Although  some  risks 
have  been  identified,  the  level  of  risk  is  low  for  the  architec- 
ture and  software  considered  during  the  evaluation.  This  system 

is  believed  to  be  the  best  approach  to  meeting  the  total  system 

goals  for  the  NASF.  In  particular,  with  thes^  concepts  no  new 
advances,  beyond  the  technology  available  today,  are  needed  in 
order  to  successfully  implement  the  facility. 
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CHAPTER  2 


NASF  SYSTEM  ARCHITECTURE 


As  indicated  in  Chapter  1,  the  feasibility  study  of  the  NASF 
required  broad  consideration  of  the  total  needs  of  the  proposed 
facility  and  of  the  expected  user  conununity*  Because  of  time  and 
budget  constraints,  detailed  study  was  based  on  ccmmercially  avail- 
able equipment  wherever  possible.  The  system  architecture  used 
for  evaluation  is  substantially  the  same  as  that  described  during 
the  Preliminary  Study  [1,  2,].  However,  some  changes  were  indi- 
cated, based  on  this  feasibility  study.  The  modeling  which  was 
done  in  support  of  this  study  was  operationally  based.  That  is, 
the  system-level  models  are  designed  so  that  they  may  be  easily 
verified  by  direct  observation  of  an  actual  system.  This  approach 
was  chosen  to  make  future  verification  of  the  models  straight- 
forward, 

2,1  OPERATIONAL  ENVIRONMENT 

Before  considering  the  system  architecture  in  detail,  it  is  impor- 
tant to  first  consider  how  the  facility  is  expected  to  support  the 
user  community.  The  planned  operational  environment  of  the  NASF 
has  been  reviewed  in  two  documents  provided  by  NASA  [3,  4],  The 
central  computational  facility  (which  includes  the  Flow  Model 
Processor  and  a Support  Processing  System)  will  be  accessed  by  a 
number  of  users  at  sites  remote  from  the  facility.  Some  of  the 
"remote"  sites  would  be  physically  nearby  (such  as  the  NASA  Ames 
facility)  while  others  would  be  at  distant  locations. 

For  the  purposes  of  the  study,  some  assumptions  were  made  concern- 
ing the  users.  The  operational  environment  described  by  NASA 
shows  that  many  of  the  users  will  be  directly  concerned  with  pro- 
duction use  of  the  facility  for  design  work.  These  production 
users  have  been  assumed,  for  purposes  of  the  study,  to  be  working 
in  design  teams  at  "design  centers” , These  design  centers  have 
been  assumed  to  have  sophisticated  graphics,  processing,  file 
storage,  and  communications  capabilities.  These  design  centers 
would  reduce  the  processing  load  of  the  facility. 

The  NASA  documents  also  pointed  out  that  other  users  will  be 
involved  v/ith  code  development,  method  development,  and  research 
in  fluid  physics  and  other  areas.  Some  of  these  users  have  been 
assumed  to  be  associated  with  the  design  centers,  at  least  as  far 
as  use  of  facilities  are  concerned.  Other  users  would  have  direct 
access  to  the  computational  facility  from  their  terminals. 
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Figure  2*1  depicts  the  assumed  operational  environment  with  the 
central  computational  facility  of  the  N/.SF  at  the  top  and  with 
users  having  access  to  that  facility  eithet  via  terminals  or  via 
design  centers*  Figure  2.2  depicts  the  organization  of  a design 
center.  All  sophisticated  graphics  equipment  was  assumed  to  be 
associated  with  design  centers.  The  processors  which  are  part  of 
each  design  center  were  assumed  to  provide  support  to  the  users 
both  in  terms  of  graphics  I/O  operations  and  in  terms  of  text  and 
file  handling.  If  the  "nearby”  design  centers  are  assumed  to 
support  fourteen  active  users  and  if  the  "remote"  centers  are 
assumed  to  support  four  active  users,  the  configuration  shown  in 
Figure  2.1  would  have  at  least  100  active  users. 

The  design  centers  have  not  been  studied  further  and  are  certainly 
not  a required  part  of  the  overall  system.  The  main  reason  for 
their  consideration  was  to  develop  a realistic  estimate  of  the 
amount  of  load  on  the  Support  Processing  System  for  text  input  and 
editing  tasks.  Based  on  the  environment  just  described,  the 
fraction  of  users  who  require  the  Support  Processor  for  data  entry 
and  editing  was  assumed  to  be  0.2.  The  other  80%  of  the  users 
either  use  the  facilities  of  a design  center,  or  have  terminals 
with  built*-in  edit  mode  capabilities, 

2.2  SYSTEM  DESCRIPTION 

The  NASF  consists  of  three  elements;  the  Plow  Model  Processor 
(PMP),  the  Support  Processing  System  (SPS)  and  the  physical  en- 
vironment including  the  building,  power,  cooling,  etc. 

2.2.1  FMP 

The  Flow  Model  Processor  (PMP)  is  a dedicated,  single-user-at-at- 
time  computing  engine  which  has  no  I/O  capabilities  except  through 
a staging  memory*  The  FFIP  is  based  on  a large  number  of  indepen- 
dent processors,  each  executing  FORTRAN  code  independently  of  the 
other.  The  extensions  to  FORTRAN  described  in  Chapter  4 include 
constructs  which  allow  description  of  significant  amounts  of  inde- 
pendent, concurrent  operations.  In  addition,  provision  was  made 
(in  both  the  hardware  and  software)  to  allow  a single  computa- 
tion to  utilize  a large  number  of  processors.  The  PMP  also  in- 
cludes a very  v/ide  bandwidth  memory  that  can  be  shared  by  all  the 
processors.  The  memory  sizes  assumed  for  the  study  were  based  on 
the  aero  flow  programs  used  for  evaluation  during  the  study.  More 
details  are  included  in  Chapter  5. 

2*2.2  Support  Processing  System 

The  Support  Processing  System  serves  as  the  central  control,  inter- 
faces with  users  and  peripherals,  maintains  the  data  files  and  pro- 
vides that  computational  support  necessary  to  keep  the  FMP  effect- 
ively utilized.  The  Support  Processing  System  consists  of  three 
portions;  a host  processor  called  the  Support  Processor,  a File 
System,  and  a Peripheral  Support  System.  Most  of  the  discussion 
throughout  the  rest  of  this  report  refers  only  to  the  Support 
Processor  (which  is  the  host)  and  to  the  File  System. 


Figure  2.1  NASF  Operational  Environment 
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TERMINAL 


2. 2. 2*1 


Support  Processor 


The  host  processor,  which  is  identified  as  the  Support  Processor 
during  this  report,  was  assumed  to  be  a dual*-processor  B7800  for 
the  purposes  of  the  study*  This  processor  was  chosen  for  two 
major  reasons.  First,  the  B7800  system  is  a new,  standard  product 
which  has  evolved  from  the  Burroughs  700  and  800  series  machines 
over  the  past  16  years.  A wide  range  of  data  communications  and 
peripheral  support  is  available  on  this  system.  Second,  because 
the  B7800  is  an  evolutionary  system,  it  supports  the  Master 
Control  Program  (operating  system),  compilers,  utilities,  and 
application  programs  developed  by  Burroughs  for  the  B6000  and 
B7000  series  processors.  The  feasibility  of  this  system  for  con- 
trol of  the  FMP  seems  clear  since  the  same  functions  are  already 
being  implemented  for  the  Burroughs  Scientific  Processor  (BSP) 
which  also  attaches  to  the  B7800  system. 

The  B7800  employs  independent  func^-ional  processing  to  distribute 
both  intelligence  and  control  among  various  processing  elements* 
The  B7800  includes  five  independent  functional  processors.  They 
include  the  central  processor,  the  input/output  processor,  a 
memory  control  processor,  a communications  processor,  and  a maint- 
enance and  diagnostic  processor.  The  configuration  assumed  for 
the  study  includes  redundancy  in  essentially  all  elements  of  the 
system,  resulting  in  very  high  availability. 

Since  the  Support  Processor  would  be  the  master  control  for  the 
facility,  most  user  communications  would  be  supported  with  the 
Data  Communications  Processor  portion  of  the  B7800.  Th«i  configur- 
ation assumed  for  the  study  included  96  input  lines,  ol  which  four 
were  synchronous  broadband  lines  (19.2  Kbps  - 1,344  Kbps).  The 
remainder  were  assumed  to  be  a combination  of  synchronous  and 
asynchronous  lines  of  various  rates  (1.2  Kbps  to  9.6  Kbps). 

In  addition  to  the  standard  line  control  disciplines,  the  B7800 
and  its  Data  Communications  Processor  would  provide  the  capability 
for  network  access  and  control.  This  capability  would  provide  the 
needed  flexibility  for  potential  users  to  be  connected  to  the 
facility  "on  their  terms". 

2. 2. 2. 2 Peripheral  Support  System 

In  addition  to  the  input /output  processors  on  the  B7800  Support 
Processor,  the  study  demonstrated  that  some  peripherals  would 
require  a significant  amount  of  computational  support.  The  most 
significant  of  these  devices  is  the  Computer  Output  to  Microfilm 
(COM)  device.  The  NASA  supplied  scenario  of  usage  [4]  postulated 
a very  heavy  COM  load  (in  excess  of  10,000  frames  per  day)  where 
the  output  was  assumed  to  be  graphic  images.  The  majority  of  this 
load  was  for  "movies"  of  complex  evaluation  results. 


The  system  utilization  analysis  (summarized  in  Section  2*3  below) 
clearly  demonstrated  the  impact  of  the  COM  formatting ^ even  when 
the  formatting  v;as  only  to  produce  listings  of  the  points  of  inter- 
est rather  than  graphics  control  procedures*  This  load  could  be 
supported  with  additional  central  processors  within  the  Support 
Processor*  Alternatively,  the  load  could  be  supported  by  doing 
the  necessary  formatting  in  the  FMP  prior  to  FMP  task  completion* 
A third  alternative  was  to  consider  a separate  system,  specifical- 
ly oriented  to  supporting  this  formatting  and  the  COM  device* 

No  study  has  been  completed  with  regard  to  what  impact  the  COM 
formatting  would  have  on  FMP  loading*  By  studying  the  Support 
Processor  loading  with  and  without  the  COM  formatting  task,  it  was 
clear  that  one  feasible  way  to  support  a load  of  this  sort  was 
with  a specialized  system  specifically  planned  for  that  support* 
Although  further  study  should  be  performed,  a Peripheral  Support 
System  configured  with  two  high-end  minicomputers  with  special- 
purpose  software  should  be  capable  of  handling  the  required  tasks* 

2. 2* 2* 3 File  System 

A separately  managed  and  accessed  file  system  is  required  as  part 
of  the  Support  Processing  System*  The  volume  of  data  and  programs 
which  will  be  moved  in  and  out  of  the  Flow  Model  Processor  togeth- 
er with  the  amount  of  file  management  required  for  the  total 
system  indicate  strongly  that  a separate  system  be  provided  for 
this  purpose  (rather  than  using  the  Support  Processor  itself  for 
example)*  Secondly,  when  file  management  functions  are  in  a 
processor  different  from  any  processors  which  may  be  executing 
user  programs,  the  confidence  in  security  capabilities  can 
increase  signif icant Jy . The  File  System  includes  the  disk  packs, 
the  archival  store,  and  the  file  manager*  Conceptually,  the  B'ile 
System  also  includes  the  Data  Base  Memory,  the  staging  area  for 
programs  and  data  within  the  FMP* 

The  File  System  is  another  part  of  the  facility  where  a detailed 
study  has  not  been  completed*  Enough  is  known  about  the  require- 
ments of  the  File  System  to  be  confident  that  such  a system  can  be 
configured  from  essentially  standard  components*  These  require- 
ments will  be  summarized  below. 

The  File  System  should  be  organized  such  that  many  simultaneous 
high-speed  transfers  are  possible*  The  NASF  architecture  requires 
four  major  connections  to  the  File  System;  the  FMP,  the  Support 
Processor,  the  Peripheral  Support  System,  and  the  Users*  In 
addition,  the  File  System  would  be  capable  of  responding  to  re- 
quests for  data  movement  within  the  File  System  and  would  provide 
automatic  management  of  the  space* 

The  FMP  requires  up  to  four  simultaneous  (12*5  Mbits/sec  each) 
paths  to  and  from  the  File  System,  although  the  use  of  these  paths 
is  not  continuous*  The  peak  requirement  of  each  of  the  two  Input- 
Output  Processors  of  the  B7800  Support  Processor  is  also  50  Mbits/ 
sec  each,  which  like  the  FMP  connection,  is  primarily  disk  I/O. 


The  Peripheral  Support  System  requirements  are  insignificant  by 
comparison*  The  interface  from  Users  to  the  File  System  would  be 
for  the  purpose  of  accessing  graphics  data  files  without  having  to 
funnel  such  requests  through  the  B7800  Support  Processor.  Again/ 
since  the  User  loading  would  be  on  the  order  of  the  Peripheral 
Support  System  loading/  the  User  loading  would  be  supported  well 
if  the  FMP  and  Support  Processor  are  supported.  The  technology 
for  connections  of  this  sort  has  already  been  reported  in  the 
literature  [15]  and  is  in  production.  Equipment  to  support  50 
Mbits/soc  per  channel  is  available  now.  The  major  thrust  of 
development/  at  least  at  Burroughs/  is  to  significantly  reduce  the 
cost  of  this  technology  or  an  equivalent. 

The  file  manager  v;ould  be  expected  to  handle  approximately  10/000 
file  creations  and  deletions  per  day.  The  File  System  would 
respond  to  approximately  25/000  requests  for  file  accesses  per 
day.  All  interfaces  to  the  file  system  would  be  in  terms  of  file 
"names”  rather  than  physical  media  position.  The  File  System 
would  perform  dictionary  management  and  storage  allocation 
functions.  Also/  the  File  System  would  be  responsible  for  data 
ownership  and  access  controls. 

The  analysis  assumed  a file  configuration  with  both  high-speed 
storage  and  lower-speed  mass  storage  on-line.  In  particular  more 
than  10^^  bits  of  high-speed  disk  storage  (25  msec  average  access/ 
3.6  MByte/sec  transfer  rate)  was  planned.  More  than  2 x 10^2  bits 
of  mass  storage  (3  seconds  average  access , 1 MByto/s»;c  transfer 
rate)  was  also  planned.  Although  these  appear  more  than  adequate/ 
the  utilization  studies  described  below  have  not  yet  considered 
what  file  capacities  would  be  required  given  tne  scenarios  sup- 
plied by  NASA  late  in  the  period  of  the  study. 

2.3  SYSTEM  UTILIZATION  STUDIES 

The  feasibility  of  the  ability  of  the  1<-ASF  system  organization  to 
be  able  to  support  the  expected  workloads  was  evaluated.  The 
evaluation  is  summarized  below  and  discussed  in  more  detail  in 
Appendix  F.  In  summary/  the  system  organization  described  above 
would  be  capable  of  supporting  the  workload  hypothesized  by  NASA 
[4]. 

The  NASF  Utilization  document  [4]  provided  by  NASA  described  the 
use  of  the  facility  in  terms  of  class  of  usage,  called  Cases,  and 
in  terms  of  the  sequence  of  Tasks  performed  for  each  job.  The 
Cases  (such  as  method  and  code  development  or  design  simulations) 
and  Tasks  are  summarized  in  Appendix  F.  Before  confidence  could 
be  gained  that  the  system  could  support  the  projected  load,  the 
committment  of  each  system-level  resource  to  each  task  was  care- 
fully charted.  These  event  sequence  charts  identify  the  sequence 
of  events  needed  to  implement  each  task.  The  charts  also  identify 
those  system  resources  (File  System,  Data  Comm,  Support  Processor/ 
FMP/  ..♦)  which  must  interact  to  implement  each  event.  Samples  of 
these  charts  are  also  included  in  Appendix  F, 
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The  charts  were  then  used  as  a model  to  develop  a protjram  which 
was  used  to  analyi^e  the  impact  of  the  hypothetical  workload  on  the 
various  components  of  the  system.  Some  of  the  data  used  in  tlje 
analysis  program  was  based  on  a benchmark  of  a mix  of  FORTRAN 
programs  on  a B7700.  A known  factor  of  improvement  was  used  to 
project  expected  B7800  performance.  In  addition,  a processor 
which  might  be  available  about  the  time  that  the  NASF  project 
would  be  implemented  was  hypothosi  52od  and  used  for  evaluation 
purposes. 

Table  2.1  summarises  the  results  of  the  analysis  of  the  Support 
Processor  loading  with  and  without  the  COM  formatting  (Hscussed 
earl j er. 


TABhF  2.1 

Support  Processor  CPU  Hours  Needed/ Hour 
(Averaged  over  Day) 


Processor 

With  COM 

Without  COM 

Similar  to  B7700 

14.2 

1.3 

Similar  to  D7800 

9.5 

.9 

"Future  Processor" 

2.8 

. 2 

In  Table  2.1  note  that  a support  processor  implemented  v;ith  the 
future  processors  expected  to  be  available  to  the  NASP  project 
could  support  the  COM  workload  with  a reasonable-sized  system  (3-4 
central  processors) . 

Table  2.2  summarizes  the  Data  Tra'^sfer  Requirements  averaged  over 
the  day  and  by  shift.  Note  that  these  data  transfer  rates  only 
show  the  average  rates,  not  the  peak  rates  needed  to  prevent  the 
data  path  from  being  a bottleneck.  The  daily  average  is  over  a 
full  24  hour  day.  The  data  rate  (char/sec)  assumes  8-bit 
characters  plus  error  control. 


Table  2.3  summarizes  the  Pile  System  control  activity  by  day.  The 
terms  ACTIVE,  LONGTERM,  and  ARCHIVE  in  the  table  indicate  the  dif- 
ferent types  of  files  expected  to  be  found  in  the  File  System. 
Active  files  are  those  only  recently  created  or  actively  used  and 
would  be  on  the  devices  with  the  fastest  access  times.  Longterm 
files  are  those  which  have  been  in  the  active  system  Cor  up  to  a 
week  with  little  or  no  use  before  being  copied  onto  a slower 
media*  Some  files  are  saved  on  on-line  mass  storage,  called  the 
Archive  in  the  table.  These  files  would  have  an  access  time  on 
the  order  of  seconds  but  would  still  be  on-line. 


TABLE  2.3 

NASP  File  System  Control  Activity  per  Day 


FILE  ACTIVITY 

FILE  TYPE 

ACTIVE 

LONGTERM 

ARCHIVE 

Files  Created 

2483 

1127 

627.3 

Files  Deleted 

2483 

1127 

627.3 

Fi les  Accessed 

19810 

827.7 

118.3 

Files  Replaced 

1302 

— 

— 
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The  analysis  performed  to  date  is  sufficient  to  give  one  confi- 
dence that  the  system  studied  would  be  capable  of  supporting  the 
hypothesized  workload*  Before  design  can  begin/  more  detailed 
studies  should  be  performed  to  determine  more  accurate  estimates 
of  grid  generation  task  requirements/  the  impact  of  interactive 
graphics  support  tasks  and  the  sensitivity  of  system  support  to 
all  parts  of  the  hypothesized  workload. 
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CHAPTER  3 


APPLICATION  ANALYSIS 


3.1  INTRODUCTION 

The  requirement  for  performance  for  the  PMP  was  initially  stated 
as  the  execution  of  the  "typical”  3D  Navier-Stokes  aero  flow  code 
on  200  X 50  X 100  grids  in  10  minutes^  with  the  provision  that  the 
PMP  should  be  a flexibly  programmable  machine  that  can  run  a 
number  of  similar  applications  with  similar  throughput.  These 
throughput  goals  can  be  restated,  with  respect  to  the  sample  aero 
flow  codes  supplied  by  NASA,  in  terms  of  a more  hardware-related 
secondary  standard  of  performance,  that  the  PMP  should  be  capable 
of  achieving  a sustained  rate  of  1.0  Gf lops/sec  on  aero  flow  codes 
that  take  advantage  of  its  architecture.  These  goals  were  met,  as 
described  in  more  detail  below. 

3.2  PRODUCTION  APPLICATIONS 

The  statement  of  work  specifically  asked  for  a design  that  is 
adapted  to  the  requirements  of  computational  aerodynamic 
programming,  with  a secondary  look  at  the  requirements  of  weather 
computations.  NASA  supplied  two  examples  of  aerodynamic  flow 
codes,  identified  as  the  ”3D  explicit"  code  and  the  "3D  implicit" 
code.  In  addition,  two  programs  exemplifying  the  weather  applica- 
tions were  supplied,  one  being  a Goddard  Institution  of  Space 
Studies  (GISS)  version  of  the  Mintz-Arakawa  global  circulation 
model,  the  other  one  being  a spectral  weather  code  from  MIT 
(Spectral) . 

3.2.1  Functional  Requirements 

The  application  areas  of  interest,  as  exemplified  by  the  codes 
supplied,  represent  a substantially  different  spectrum  of  appli- 
cations that  one  would  arrive  at  by  questioning  all  of  the  users 
of  very  high  speed  numerical  computation. 

A general  purpose  very  high  speed  numerical  computing  machine  must 
support  a wide  variety  of  precision  requirements.  For  example, 
users  with  sparse  and  ill-conditioned  matrices,  such  as  one  finds 
in  some  structures  applications,  require  very  high  precision,  for 
some  users  well  over  30  decimal  digits.  Aero  flow  and  weather 
codes  apparently  will  run  happily  with  not  more  than  10  or  12 
decimal  places  of  precision,  with  much  of  the  computation  and  most 
of  the  data  needing  only  six  or  seven  places  of  precision. 
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It  has  been  appreciated  for  two  decades  that  the  speed  of  light 
puts  an  upper  limit  to  the  throughput  of  an  uniprocessor,  and  that 
very  high-speed  machines  must  use  some  sort  of  parallelism  or 
concurrency  in  order  to  achieve  throughput.  Traditionally,  paral- 
lelism has  taken  two  forms,  first,  vector  machines  in  which  only 
data  with  extreme  regularity  could  be  processed  in  parallel,  and 
second,  multiprocessors  in  which  many  totally  independent  programs 
run  in  parallel.  Because  aero  flow  and  weather  codes  can  be 
vectorized,  a vector  machine  could  be  made  to  work.  However,  the 
vector ization  imposes  inefficiencies  (for  example,  subroutine 
CHARAC  in  the  3D  explicit,  or  C0MP3  out  of  the  GISS  weather).  As 
a result  a machine  that  is  efficient  only  for  vectors  is  often  not 
efficient  when  considering  all  of  the  programs  that  one  expects 
that  computational  fluid  dynamicists  would  want  to  write. 

Hence,  part  of  the  problem  is  to  demonstrate  the  feasibility  of  a 
flow  model  processor  that  is  as  efficient  on  vectors  as  the  tradi- 
tional vector  machine,  and  is  also  efficient  when  the  concurrent 
processing  is  on  data  that  does  not  form  vectors.  Furthermore, 
the  language  should  allow  for  the  description  of  parallel  (or 
vector)  operations  and  for  concurrent  scalar  processes  which  are 
independent  of  each  other,  or  for  any  mixture  of  the  two. 

Demonstration  of  optimum  feasibility  of  the  FMP  for  its  applica- 
tion set  therefore  includes: 

- Provision  of  concurrency  (or  parallelism)  for  high  throughput 
without  the  requirement  for  vector ization  of  the  algorithm. 
Although  the  implicit  algorithm  is  easily  vectorized,  and  the  3D 
explicit  is  also  easily  vectorized,  the  earlier  2D  explicit  was 
not  all  easily  vectorizable,  and  a large  portion  of  weather  (sub- 
routine C0MP3)  can  be  vectorized  only  with  difficulty  and  a large 
penalty  in  throughput. 

- A language  (FMP  FORTRAN)  in  which  one  can  write  either  non- 
vector ized  concurrent  operations,  or  vector  operations. 

- Word  size.  The  computational  fluid  dynamics  and  weather 
community  requires  no  more  than  10  or  12  decimal  digits  of  preci- 
sion, corresponding  to  33  or  36  bits  in  the  fraction  part  of  the 
floating  point  word,  with  some  computation  and  most  data  requiring 
no  more  than  six  or  seven  digits  (24  bit  fraction  part)  of  preci- 
sion. This  is  not  true  of  other  "typical**  users  of  very  high- 
speed numeric  processing.  Requirements  for  precision  run  from  8 
bits,  for  picture  processing,  to  over  30  decimal  digits,  for  users 
with  large,  sparse,  ill-conditioned  matrices,  typically  structures 
and  applications.  A large  number  of  scientific  processor  users 
desire  14  to  16  decimal  digits  of  precision. 

- A language  based  on  FORTRAN  to  accomodate  the  applications 
programmers,  who,  in  the  computational  fluid  dynamics  and  weather 
communities,  have  mostly  been  used  to  working  in  one  dialect  or 
another  of  FORTRAN. 
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3.2,2  Projected  Performance^  Summary 

For  the  aero  flow  codes,  the  FMP  here  described  would  cun  the  3D 
implicit  in  6 minutes  and  16  seconds  (100  times  steps)  at  a 
throughput  rate  of  1*01  Gf  lops/sec*  during  that  time.  The  3D 
explicit  runs  in  8 minutes  and  52  seconds  (again,  for  a test  case 
with  100  time  steps)  at  a throughput  rate  of  0.89  Gf  lops/sec 
during  that  time.  In  both  cases,  the  mesh  had  a million  grid 
points  (100  X 50  X 200  in  the  case  of  the  implicit,  100  x 100  x 
100  for  the  explicit).  Feasibility  is  therefore  demonstrated. 

Other  metrics  can  be  used  to  describe  the  "raw”  throughput,  of 
which  the  above  is  the  net; 

2.22  Gf lops/sec  would  be  the  maximum  throughput  rate  given  that 
operations  are  alternately  add  and  multiply. 

1.74  Gflops/sec  would  be  the  maximum  throughput  rate  for  register- 
to-register  operations  using  the  instruction  mix  derived  from 
analysis  of  the  aero  flow  codes. 

1.33  Gflops/sec  would  be  the  throughput  rate  seen  in  about  half  of 
the  sequences  of  code  submitted  to  the  simulator. 

Of  the  above,  the  figure  of  1.33  Gflops/sec  represents  a through- 
put rate  achieved  by  a number  of  real  sequences  of  code,  taken 
from  both  aero  flow  codes  and  from  weather.  It  represents  an 
achievable  throughput  for  "friendly"  applications. 

All  of  the  above  refers  just  to  the  FMP.  The  throughput  of  the 
NASF  is  just  as  much  dependent  on  proper  function  of  the  Support 
Processor  System  (SPS)  as  it  is  on  the  FMP.  The  SPS,  however, 

presents  well-known  problems,  not  unique  problems  for  the  partic- 
ular set  of  applications. 

3.3  PERFORMANCE  PROJECTION  BASED  ON  BENCHMARK  PROGRAMS 
3.3.1  Summary 

The  four  programs  used  as  benchmarks  in  evaluating  the  design 
were: 


- NASA  3D  implicit  aero  flow  code  supplied  by  Ames 

- NASA  3D  explicit  aero  flow  code  supplied  by  Ames 

- GISS  weather  code,  in  several  different  versions 

- Spectral  weather  code  from  MIT 

Evaluations  of  the  first  three  were  comprehensive,  resulting  in 
the  projections  of  1.01  Gflops/sec  and  6 minues,  16  seconds,  for 
the  implicit,  0.89  Gflops/sec  and  8 minutes,  52  seconds  for  the 


*Gfiops/sec,  Billion  floating  point  operations  per  second. 


3-3 


explicit  at  the  size  of  the  benchmark,  and  0.53  Gf lops/sec  and  4 
minutes,  25  seconds  for  the  GISS  weather  code.  Appendix  A 
discusses  these  evaluations  in  detail.  These  evaluations  and  the 
conditions  leading  to  these  conclusions  are  summarized  in  this 
chapter . 

The  implicit  code  achieves  the  1.0  Gf  lops/sec  being  used  as  a 
guide  for  evaluating  adequate  throughput  rate.  The  explicit  code 
nearly  does.  Since  the  intent  of  the  explicit  is  to  be  computa- 
tionally more  efficient  than  the  implicit,  the  performance  goals 
are  deemed  demonstrated. 

On  GISS  weather,  the  non-vectorizable  portions  of  the  code  exe- 
cuted at  more  than  one  Gflops/sec  (subroutine  COMP3),  while  the 

throughput  rate  observed  in  vectorizable  portions  (COMPl  and 
COMP2)  was  reduced  by  EM  accessing  and  memory-to-memory  moves  that 
produced  no  floating  point  operations. 

Examination  of  the  spectral  weather  shows  that  the  fluid  dynamics 
portion  should  run  with  higher  flop  rate  than  the  fluid  dynamics 
portion  of  the  GISS  weather  (COMPl  and  COMP2),  and  that  the  chemis- 
try and  physics  portions  were  essentially  identical  to  COMPS. 

Hence,  the  spectral  weather  is  expected  to  run  at  a higher  flop 
rate  than  the  GISS  weath^^r. 

3.3.2  Method 

The  method  used  for  performance  evaluation  was  generally  the  same 
for  all  of  the  first  three  benchmark  programs.  Because  of  time 
and  budget  limitations,  only  a cursory  look  was  taken  at  the 
Spectral  weather  code. 

Throughput  was  analyzed  on  the  basis  of  FMP  computations.  I/O 
operations  were  ignored.  Transfers  between  DBM  and  file  system 
are  independent  of,  and  go  in  parallel  with,  the  FMP  computation. 
It  is  assumed  that  the  file  manager  stages  the  next  job,  and 
unloads  the  last  job,  in  times  which  are  completely  overlapped 

with  current  computation.  DBM-EM  transfers  are  also  ignored, 

since  they  go  on  concurrently  with  current  processing  as  long  as 
EM  space  is  available  and  take  negligible  time.  The  15  million 

words  of  a restart  point  of  a typical  aero  flow  code  are  loaded  in 
0.375  seconds,  which  can  be  compared  with  the  600  seconds  duration 
of  a typical  run.  Therefore,  both  system  I/O  and  user  I/O  were 

ignored. 

Each  program  was  analyzed  to  find  the  calling  tree  of  the  sub- 
routines, and  subroutines  were  divided  into  sequences  of  code  that 

were  internally  similar.  Analysis  was  performed  on  each  indivi- 
dual sequence  and  the  results  combined,  taking  into  account  proces- 
sor utilization  percentages  and  number  of  exceptions,  into  total 
figures  for  each  of  the  benchmark  programs.  Thus  the  analysis 
included  every  line  of  code  in  the  first  three  programs. 


Analysis  consisted  of  hand  compilation  and  simulation  for  a select- 
ed number  of  code  sequences,  and  estimates  based  on  interpolation 
between  known  simulations  for  the  rest  of  the  sequences.  In  the 
implicit  aero  flow  code,  over  60%  of  the  computations  of  the 
program  are  within  the  inner  loop  which  was  simulated.  For  those 
sequences  which  were  not  simulated,  a formula  was  developed  which 
interpolated  between  the  simulated  sequences.  For  a more  detailed 
description  of  this  method,  see  Appendix  A.  Various  cases  in 
which  exception  should  be  taken  to  the  formula  were  also  taken 
into  account.  It  was  found  that  almost  allof  the  simulation 
results  could  be  empirically  fit  by  a formula  of  the  form 

T =(ki  F + k2M  + k3D)/P  (3.1) 

where  T is  the  time  required  to  execute  a particular  code  segment, 
F is  the  number  of  floating  point  operations  in  that  segment,  M is 
the  number  of  EM  accesses,  and  D is  the  number  of  divisions  over 
and  above  the  2%  divides  assumed  by  the  "standard”  instruction 
mix,  and  p is  the  number  of  processors  processing.  The  constants 
k^  and  k2  are  determined  empirically  from  the  simulation  results, 
and  k3  was  set  equal  to  the  time  of  a divide  instruction.  Through- 
put for  the  individual  code  segment  is  given  by  F/T. 

The  hand  compilation  made  certain  assumptions  about  the  compiler. 
Assignment  of  instances  of  the  DOALL  to  processor  was  not  optim- 
ized, but  done  by  the  simplest  algorithm  conceivable  (see  Chapter 
4 for  software  discussions).  Optimization  steps  such  as  the 
substitution  of  an  add  or  subtract  from  exponent  to  replace  a 
multiplication  or  division  by  a power  of  2 were  assumed. 

3.3.3  Throughput  of  Aero  Flow  Codes 

The  implicit  aero  flow  code,  for  which  simulation  covered  over  60% 
of  the  computations,  was  estimated  to  have  a throughput  rate  of 
1.01  Gf lops/sec  at  the  100x50x200  size  and  ran  100  time  steps  in  6 
min,  16  sec.  The  implicit  code  showed  an  estimated  throughput 
rate  of  0.89  Gflops/sec  at  the  100  x 100  x 100  mesh  size  and  ran 
100  time  steps  in  8 min.  52  sec.  Details  are  in  Appendix  A. 

The  language  being  considered,  FMP  FORTRAN  (described  in  more 
detail  in  Chapter  4),  was  found  to  fit  the  aero  flow  codes  very 
conveniently.  A simple,  one-to-one  translation  from  FORTRAN  codes 
provided  into  FMP  FORTRAN  goes  as  follows.  All  arrays  subscripted 
with  the  grid  variables  are  made  GLOBAL.  DO  loops  (single  or 
nested)  on  the  grid  variables  are  automatically  turned  into  DOALLs 
as  long  as  the  data  dependence  allows  it.  Temporary  variables  are 
allowed  to  be  LOCAL  by  default.  The  implicit  code,  as  supplied  by 
NASA,  is  of  such  regularity  that  practically  all  of  it  can  be 
transformed  into  FMP  FORTRAN  using  such  simple  rules.  Because  of 
this,  and  in  order  to  save  time,  most  of  the  FMP  FORTRAN  versions 
of  the  aero  codes  were  not  even  written  down,  since  they  are 
obvious  from  the  FORTRAN  versions  provided  by  inspection. 
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During  the  hand  compilation  process  it  was  found  that  translation 
from  FMP  FORTRAN  to  FMP  machine  code  was  simple  and  straight- 
forward. This  gives  confidence  that  the  compiler  will  be  rela- 
tively simple  to  write, 

Procesor  utilization  ranged  from  93%  (in  the  explicit)  to  an 
average  of  97.4%  (in  the  implicit).  Some  routines  gave  99.9%  pro- 
cessor utilization.  All  subroutines  and  other  code  sequences  were 
included  in  the  total  time  and  total  number  of  floating  point 
operations.  In  neither  aero  flow  code  did  any  of  those  sequences 
with  low  processor  utilization  have  any  influence  on  the  final 
throughput  estimate. 

3.3.4  Weather  and  Climate  Codes 

Two  benchmark  programs  were  supplied  by  NASA  Ames  for  use  in 
evaluating  the  performance  of  the  FMP  for  weather  and  climate 
codes.  The  first,  a Goddard  Institute  of  Space  Studies  version  of 
the  Mintz-Arakawa  global  circulation  model,  came  in  several  differ- 
ent versions  written  for  several  different  machines.  These 
various  versions  are  seen  to  have  variations  in  portions  of  the 
algorithm.  The  version  analyzed  was  one  written  for  the  360/195. 
This  is  the  same  version  that  had  previously  been  used  as  a test 
case  for  analyzing  BSP  throughput. 

The  second  is  a "spectral”  weather  code  from  MIT,  in  which  an  PFT 
is  used  to  regularize  the  hydrodynamical  computations. 

The  GISS  code  was  analyzed  at  an  intermediate  grid  size  (2®  lati- 
tude steps,  2.5°  longitude  increments  along  the  equator  with  20 
minute  time  steps).  The  program  consists  of  an  easily  vectoriz- 
able  fluid  dynamics  section  (subroutines  COMPl  and  COMP2  and  the 
subroutines  they  in  turn  call),  and  COMP3  and  its  callees,  the 
physics  and  chejiiistry  section.  The  average  throughput  rate  for 
the  entire  program  was  determined  to  be  0.532  Gf lops/sec  with  a 
14-day  simulation  taking  4 minutes,  25  seconds  of  FMP  time. 

The  GISS  climate  code  demonstrated  the  advantages  of  the  FMP 
architecture.  The  vector izable  portions  tended  to  run  slow 
because  of  many  EM  accesses,  but  COMP3  and  its  subroutines  ran  as 
independent  scalar  processes  in  parallel  in  all  the  processors, 
achieving  over  1.2  Gflops/sec  for  the  portion  simulated.  COMP3 
and  its  subroutines  have  been  shown  to  be  hard  to  vectorize  for 
existing  vector  machines,  whereas  it  is  not  necessary  to  vectorize 
them  for  the  FMP. 

In  this  benchmark,  substantial  use  is  made  of  parts  of  the  lang- 
uage that  see  little  or  no  use  in  the  two  aero  flow  codes, 
including: 
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- Domain  definitions  using  domain  expressions  that 
include  previously  defined  domains 

- INALL  arrays 

Initialization  of  values  in  declaration  by  arithmetic 
expressions 

- NEXTDO 

• Branching  within  DOALLS 


This  is  a worst-case  analysis^  in  that  any  data  dependent  branches 
were  assumed  to  demand  the  most  computations*  This  approach  was 
used  in  order  to  estimate  the  worst-case  maximum  running  time  of 
the  GISS  climate  code*  Other  conditions  which  simplify  the  radi- 
ation calculations  (such  as  the  existence  of  cloud  cover)  will 
result  in  fewer  floating  point  operations,  and  shorter  times* 
Whether  the  Gf lops/ sec  rate  would  go  up  or  down  under  these  condi- 
tions depends  on  whether  flops  or  elapsed  time  is  reduced  propor- 
tionately more*  This  case  was  not  analyzed* 

The  spectral  weather  code  is  expected  to  run  with  substantially 
higher  throughput  than  the  GISS  climate  code  does*  Its  fluid 

dynamics  portions  are  done  by  spectral  analysis,  with  each 
processor  processing  an  PFT  independently  of  all  other  processors* 
Thus,  the  fluid  dynamics  computations  are  much  more  locally 
contained,  since  all  the  intermediate  results  in  the  FFT  can  be 
contained  within  processor  memory  (declared  either  INALL  or  LOCAL) 
and  should  run  faster*  The  chemistry  and  physics  portions  of  the 
spectral  weather  code  are  substantially  identical  to  the  chemistry 
and  physics  portions  of  the  GISS  climate  code,  and  the  analysis  of 
one  can  represent  the  analysis  of  the  other* 

3*3*5  Applications  Beyond  the  Benchmarks 

The  analysis  summarized  earlier  in  this  chapter  and  in  Appendix  A 
demonstrates  the  applicability  of  the  FMP  described  in  Chapter  5 
to  aero  flow  and  weather  codes*  This  analysis  is  therefore  a 
constructive  demonstration  of  the  feasibility  of  the  NASF*  The 
FMP  as  described  has  broader  applicability  than  to  applications 
similar  to  the  four  benchmarks,  as  the  remarks  in  this  section 
will  indicate*  The  following  are  considered: 

- Single  PFT  (In  the  spcrctral  weather  code  the  512  processors 
do  512  FFT’s  concurrently) 

- Sort 

- Problems  too  large  to  fit  in  Extended  Memory  (in  **core”) 


3-7 


In  Section  A* 6 of  Appendix  Ar  the  PFT  is  discussed • That  section 
shows  that  the  FMP  runs  various  sizes  of  PPT  at  throughputs 
varying  from  0.5  through  0.7  Gf lops/sec.  The  reduction  from  1.3 

Gf lops/sec  is  due  to  data  rearrangement. 

Section  A. 6 also  discusses  a method  for  achieving  concurrency  in 
sorting  keys  or  data  elements  that  are  contained  in  processor 
memory.  The  particular  method  shown  runs  at  100%  processor 

utilization  when  sorting  an  array  of  elements  that  start  out  being 
sorted  in  inverse  order ^ such  a case  being  a kind  of  worst-^case 
test  for  some  sorting  algorithms. 

3. 3. 5.1  Large  Problems 

The  “standard”  scenario  for  the  use  of  the  FMP  is  that  all  files 
necessary  for  the  use  of  an  FMP  task  are  in  place  within  DBM  at 
the  time  the  task  is  started.  During  the  course  of  a run^  the 

task  is  essentially  self-contained  within  the  "main  memory”, 
namely  EM,  CM  and  PM.  This  does  not  preclude  reading  from  DBM  to 
EM,  or  writing  to  DBM  from  EM  at  appropriate  times.  A set  of 
files  are  located  in  DBM  at  the  start  of  the  run.  Piles  may  be 
created  within  DBM  during  the  course  of  the  run,  snapshot  dumps 
for  instance,  and  when  such  files  are  closed  by  the  FMP  program, 
the  file  system  has  the  option  of  moving  them  out  of  DBM  before 
the  run  is  finished.  The  concept  of  having  the  high-speed  computa- 
tions contained  within  a bounded  portion  of  the  hardware,  here  the 
FMP,  with  no  interaction  with  external  devices  such  as  the  support 
processor,  has  been  given  the  name  "computational  envelope’’. 
However,  the  computational  envelope  is  not  completely  sealed  even 
during  the  "standard”  scenario. 

Another  scenario  is  the  running  of  tasks  that  will  not  fit  in  main 
memory.  The  following  questions  are  considered.  First,  what 
facilities  should  be  available  with  the  initially  delivered  com- 
piler; second,  what  facilities  are  envisioned  for  possible  later 
implementation;  and  third,  what  problem  properties  allow  efficient 
operation  for  problems  that  do  not  fit  in  memory. 

The  system  evaluated  during  this  study  does  not  have  the  system 
software  required  for  automatic  virtual  memory  management  for 
taking  care  of  overflow  from  EM.  Hence,  the  user  programmer  will 
have  to  insert  READ  and  WRITE  statements  for  access  to  DBM  files. 
As  with  any  other  direct  I/O  scheme,  it  behooves  the  programmer  to 
initiate  I/O  ahead  of  time,  and  test  for  completion  at  the  point 
of  using  it*  Only  one  direct  I/O  operation  can  be  going  on  at  one 
time.  If  a second  direct  I/O  is  called  for  before  the  first  has 
finished,  the  program  would  wait  for  the  first  I/O  operation  to 
finish  before  initiating  the  second.  User  processing  would  be 
suspended  until  that  second  direct  I/O  is  started. 
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The  PMP  hardware,  which  is  described  in  more  detail  in  Chapter  5, 
is  intended  to  be  able  to  support  a virtual  memory  mechanism 
whereby  certain  Extended  Memory  (EM)  files  can  be  held  in  Data 
Base  Memory  (DBM)  when  EM  does  not  have  enough  space,  EM  space 
would  be  dynamically  allocated,  and  addressed  with  base  registers, 
hence  one  possible  implementation  of  such  virtual  memory  is  to 
have  the  base  addresses  of  non-present  data  point  outside  of 
actual  memory  space.  The  address-out-of-bounds  interrupt  would 
trigger  transfers  between  DBM  and  EM,  plus  some  processing  to  fix 
up  base  addresses. 

The  system  considered  during  this  study  does  not  have  the  system 
software  to  implement  such  a virtual  memory  scheme.  This  lack  of 
automatic  virtual  memory  management  did  not  impact  the  throughput 
studies  of  the  aero  codes  and  GISS  codes  since  these  benchmarks 
will  be  able  to  reside  within  the  planned  EM  space. 

Hardware  mechanisms  that  allow  virtual  memory  for  Processor  Memory 
(PM)  space  using  EM  memory  space  to  back  up  the  PM  space  should 
also  be  planned.  Methods  for  supplying  this  feature  are  still 
under  discussion.  One  suggestion  is  that  EM  module  No.  P could  be 
assigned  to  processor  No*  P,  giving  each  processor  its  own  private 
EM  module  for  back-up  for  virtual  memory  purposes. 

Some  of  the  characteristics  of  an  aero  flow  code  that  would  exe- 
cute satisfactorily,  even  though  it  would  not  fit  in  the  Extended 
Memory  (EM)  can  be  determined  by  analysis  independent  of  the 
method  used  to  extend  the  storage  capacity  for  problems  into  the 
Data  Base  Memory  (DBM). 

The  following  discussion  is  based  on  the  3D  implicit  aero  flow 
code,  whose  major  computational  effort  is  in  subroutine  STEP  and 
its  subroutine  BTRI.  A listing  of  BTRI  in  FMP  FORTRAN  is  included 
in  Appendix  H. 

The  bandwidth  between  EM  and  DBM  (detailed  in  Chapter  5)  is  50 
million  words  per  second.  Since  data  overlay  requires  moving  idle 
data  out  to  make  room  for  the  new  data,  half  of  this,  or  up  to  25 
million  words  per  second,  is  the  rate  that  files  can  be  brought  in 
to  be  worked  on.  If  the  throughput  rate  (1.0  Gf lops/ sec)  is  to  be 
maintained,  there  must  be  40  or  more  floating  point  operations  for 
every  word  brought  into  the  EM  from  DBM.  This  goal  can  be  met  and 
then  some.  As  an  example,  consider  the  following  demonstration  of 
one  way  of  programming  a 220  x 220  x 220  3D  implicit  aero  flow 
program. 

The  data  is  blocked  into  64  blocks,  each  55  x 55  x 55.  The  organ- 
ization of  these  blocks  is  shown  in  Pig  3.1.  At  any  given  time, 
four  blocks  forming  a "pencil”  in  one  direction  will  be  in  EM. 
Computation  sweeps  from  one  end  of  the  pencil  to  the  other  and 
back  again,  so  that  having  anything  less  than  a pencil  in  EM  will 
increase  the  amount  of  overlays  between  EM  and  DBM  dramatically. 
Analysis  of  subroutines  STEP  and  BTRI  shows  that  there  are  about 
84  floating  point  operations  on  each  datum,  larger  than  the  40 
required  for  the  desired  throughput. 
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Memory  requireihents  are  dominated  by  the  pencils.  It  is  conven- 
tional in  such  overlaying  situations  to  have  one  pencil  in  "core” 
being  computed  on,  one  pencil's  worth  of  space  allocated  to  the 
next  pencil  being  brought  in,  and  one  pencil's  worth  of  space 
allocated  to  the  newly  created  data  that  is  being  written  out. 
Since  only  one  transfer  is  going  on  at  a given  time,  in  the  pre- 
sent instance  it  may  be  possible  to  use  the  space  vacated  by  newly 
created  data  to  contain  the  next  pencil,  so  that  only  two  pencil's 
worth  of  space  are  needed.  Assuming  15  variables  per  point,  and 
665f500  mesh  points  per  pencil,  three  pencils  (the  worst  case) 
would  occupy  29^947,500  words  in  EM;  two  pencils  (the  more  likely 
case)  would  occupy  19,965,000  words  in  EM. 

Although  the  EM-DBM  data  transfers  are  completely  hidden  behind 
computation,  and  do  not  slow  down  the  throughput,  there  will  be  a 
throughput  reduction  from  the  1.01  Gf lops/sec  analyzed  in  Appendix 
A from  another  cause.  Not  all  the  arrays  declared  LOCAL  in  the  3D 
implicit  of  the  analysis,  will  fit  in  processor  memory.  Some  of 
these  arrays  will  have  to  be  held  in  EM,  where  the  access  time  is 
longer.  Alternatively,  recomputation  can  be  used  to  avoid  the 
saving  of  precomputed  results. 

After  sweeping  16  such  pencils  in  one  direction,  direction  is 
switched  and  16  pencils  are  swept  in  the  second  direction,  and 
then  in  the  third. 

Virtual  memory  machines  have  been  on  the  market  for  19  years  at 
least;  the  Burroughs  B5000  is  an  early  example.  All  of  the  commer- 
cially available  virtual  memory  mechanisms  show  varying  degrees  of 
throughput  reduction  when  the  data  base  for  the  problem  is  larger 
than  the  main  memory  of  the  machine.  When  the  programmer  controls 
his  own  direct  I/O,  there  is  the  opportunity  for  favorable  cases, 
such  as  the  implicit  aero  flow  above,  to  achieve  full  machine 
throughput  on  problems  too  large  to  fit  in  main  memory. 

3.3.6  Application  Domain 

The  primary  area  of  application  of  the  FMP,  according  to  the  state- 
ment of  work,  will  be  the  aero  flow  codes.  A secondary  area  of 
application  will  be  the  weather  and  climate  codes.  Analysis  of 
the  benchmark  programs  shows  that  for  reasonable  grid  sizes,  the 
desired  throughput  is  achieved.  The  range  of  problem  sizes  for 
which  the  throughput  applies  is  analyzed  here,  as  well  as  what  is 
the  largest  problem  that  will  fit  in  the  DBM  using  the  approach 
described  in  the  previous  Section  3.3.5. 

In  the  aero  flow  code,  the  smallest  "good”  grid  size  is  that  which 
permits  two  dimensional  DOALLs  to  run  with  reasonable  efficiency. 
Hence,  the  smallest  grid  has  a single  dimension  of  not  much 
smaller  thanV5l2.  A grid  of  22  x 22  x 22  is  the  smallest  that 
runs  with  94%  processor  utilization  or  better.  The  largest  aero 
flow  code  is  the  largest  one  whose  data  base  will  fit  in  EM. 
Assuming  fifteen  variables  per  grid  point,  and  2^5  words  in  the  EM 
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address  space,  that  is  a grid  of  about  2.2  x 10®  grid  points. 
Other  EM  space  requirements  reduce  the  figure.  The  largest  pro- 
gram that  will  fit  in  EM  and  DBM  using  direct  I/O  has  a complete 
data  base  allocated  to  DBM.  At  the  currently  specified  size  of 
DBM,  namely  2^7  words,  this  is  an  upper  limit  of  about  9 x 10® 
grid  points.  If  a larger  upper  limit  were  required,  the  size  of 
DBM  could  easily  be  increased. 

In  the  weather  and  climate  codes,  the  grid  has  a much  smaller 
dimension  in  height  than  it  does  in  latitude  or  longitude.  Not 
all  weather  code  grids  are  in  terms  of  longitude  and  latitude; 
other  two-dimensional  grids  can  be  mapped  onto  the  surface  of  the 
earth  as  well.  In  the  GISS  climate  code,  as  translated  for  the 
PMP,  almost  all  of  the  DOALLs  are  on  a single  layer.  Thus,  as 
long  as  that  layer  is  neatly  512  elements,  the  bulk  of  the  computa- 
tions will  be  done  with  good  processor  utilization.  The  smallest 
"good"  weather  problem  would  be  one  with  15®  longitude  spacing 
along  the  equator,  a grid  of  20  x 24  in  each  layer. 

Subroutine  AVRX  of  the  GISS  code  represents  a non-negligible 
portion  of  the  computation.  Appendix  A describes  five  different 
ways  that  AVRX  may  be  mapped  onto  the  PMP.  Any  one  of  the  five 
ways  will  work,  but  all  have  some  drawback.  The  throughput  of 
AVRX  will  be  poorer  at  the  smaller  grid  sizes,  and  the  preferred 
implementation  may  vary  as  a function  of  grid  size.  Hence,  the 
throughput  estimates  of  Appendix  A (0.53  Gf lops/sec) , will  have  to 
be  revised  somewhat  for  different  grid  sizes  to  take  into  account 
the  effects  of  AVRX*.  At  the  grid  size  of  89  x 144  analyzed  in 

Appendix  A.  AVRX  was  2.2%  of  the  running  time,  executing  at  0.065 

Gf  lops/sec. 

The  largest  weather  code  that  will  fit  in  memory,  assuming  16 
variables  per  grid  point,  would  be  a grid  size  of  about  432  x 268 
x 15  levels,  or  roughly  three  times  the  resolution  of  the  case 
explored.  Alternatively,  a grid  size  of  512  x 320  x 12  would  also 
fit.  As  with  the  aero  flow  codes,  a grid  with  four  times  as  many 
points  will  fit  into  DBM,  say  864  x 536  x 15  or  1024  x 640  x 12. 

Running  time  on  these  latter  codes  would  be  quite  long.  Doubling 

the  resolution  roughly  raises  the  running  time  by  a factor  of  8, 
assuming  the  same  number  of  levels,  since  the  spatial  resolution 
and  the  time  resolution  are  roughly  proportionate.  Hence,  with  a 
grid  size  of  432  x 268  x 9,  which  is  triple  the  resolution  of  the 
analyzed  case,  27  times  the  running  time  of  the  89  x 144  x 7 grid 
size  analyzed  is  expected.  At  this  triple  resolution,  a fourteen 
day  run,  with  a 7 minute  time  step,  would  take  roughly  two  hours, 
based  on  multiplying  4 minutes,  25  seconds  by  twenty  seven. 


*A  rough  estimate  of  0.36  Gflops/sec  for  the  20  x 24  size  is 
arrived  at  by  the  following  approximations.  AVRX  throughput 
(0.065  Gflops  per  second)  and  running  times  are  assumed  the  same 

for  the  small  grid  as  for  the  analyzed  case.  For  the  rest  of  the 

code,  throughput  is  assumed  the  same,  but  the  running  time  was 
reduced  by  a factor  of  26,  since  DOALLs  drop  from  26  cycles  to  one 

cycle.  The  result  is  one  twenty-sixth  as  much  useful  computation 

done  in  6.08%  of  the  time. 


CHAPTHR  4 
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4*1  INTRODUCTION 

The  primary  uses  of  the  NASF  are  expected  to  be  design  and  model- 
ing applications*  These  applications  can  be  approached  either  by 
experimentation  (such  as  with  wind  tunnels)  or  by  simulation. 
Figure  4.1  shows  the  relationship  of  these  two  approaches.  The 
NASF  is  expected  to  support  the  abstraction  of  the  "Real  World" 
with  some  mathematical  system.  Mathematical  conclusions  will  be 
established  as  a result  of  the  simulation  and  these  conclusions 
will  then  be  interpreted  to  determine  the  desired  physical 
conditions. 

The  abstraction  process  represents  the  development  of  algorithms 
to  model  real-world  situations*  The  NASF  should  provide  tools  and 
support  to  assist  in  this  abstraction  process.  The  system  con- 
sidered in  this  Feasibility  Study  would  provide  support  for  the 
abstractiofi  process  both  with  simple  extensions  to  the  well-known 
FORTRAN  language  and  with  an  interactive  system  which  can  be  used 
to  observe  the  results  of  the  use  of  the  model. 


Figure  4.1  Relationship  of  Simulation  and  Experi*nentation 
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The  simulation  process  would  also  be  supported  with  the  language 
extensions.  The  Support  Processor  and  the  File  System  would  be 
used  with  the  FMP  during  the  simulation  process  to  provide  the 
same  careful  controls  and  monitoring  needed  during  an  experi-- 
mentation  process.  The  results  of  simulations  would  be  observed 
through  use  of  the  various  NASF  user  facilities  (printers,  gra- 

phics terminals,  COM,  etc.)  for  interpretation  by  the  users. 
Where  the  results  of  experiments  might  be  available  on  the 
facility,  comparisons  between  simulations  and  experiments  would  be 
made. 

These  processes  (abstraction,  simulation,  and  interpretation) 
require  use  of  most  of  the  components  planned  for  the  NASF.  The 
system-level  components  were  already  described  in  Chapter  2. 
Chapter  5,  which  follows,  discusses  the  Flow  Model  Processor  (FMP) 
in  detail.  This  chapter  concentrates  on  the  system-level  software 
required  to  support  these  processes.  The  most  direct  software 

support  of  users  comes  from  some  means  of  describing  the  mathema- 
tical system  which  is  the  result  of  the  abstraction  process  and  of 
controlling  the  simulation  process.  In  the  NASF,  the  language 

used  to  define  processes  on  the  FMP  provides  the  support  required. 
Other  forms  of  software  support  are  the  Master  Control  Program 

(the  operating  system  which  controls  all  parts  of  the  NASF) , the 
File  System  Control  Software,  Intrinsics,  and  Test  and  Diagnostic 
Support  Software. 

4 . 2 FMP  FORTJRAN 

The  language  considered  as  a means  for  supporting  the  design  and 
modelling  applications  on  the  NASF  is  a dialect  of  FORTRAf^.  This 
dialect  is  based  on  ANSI  Standard  X3. 9-1978  [10]  and  includes  a 
tew  extensions  which  are  appropriate  both  to  implementation  of  the 
models  and  their  simulation. 

The  description  of  the  FMP  FORTRAN  presented  here  is  substantially 
the  same  as  that  actually  used  during  the  application  analysis 
(see  Chapter  3).  The  language  constructs  presented  are  particular- 
ly oriented  to  describing  a set  (or  collection  of  sets)  of  dis- 
crete processes  which  may  be  used  to  define  the  desired  models. 
This  simple  set  of  constructs  seems  sufficient  to  support  the 
applications  planned  for  the  NASF. 

4.2.1  Language  Design  Considerations 

The  design  of  a language  must  be  concerned  not  only  with  the 
utility  of  its  use  for  applications,  but  it  must  also  consider 
problems  of  complexity,  of  implementation  on  the  hardware  of 
interest,  and  of  debugging  and  verification  capabilities. 
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4*2*1* 1 Complexity 


As  the  capabilites  of  available  hardware  expands ^ the  uses  of  the 
hardware  have  expanded  to  the  point  where  the  software  has  become 
extremely  complex*  The  development  of  a new  language  to  support 
the  NASF  community  therefore  must  consider  the  problems  of  com- 
plexity. Since  programs  are  morf  often  read  than  written,  the 
language  source  becomes  important  in  two  forms  of  communication, 
one  with  other  users  and  the  other  with  the  NASF*  The  most  impor- 
tant concern  then  is  to  try  to  achieve  a match  between  the  appar- 
ent complexity  in  a program  and  our  human  ability  to  deal  with 
that  complexity*  The  language  constructs  described  in  the  follow- 
ing sections  have  been  chosen  to  highlight  the  essential  major 
ideas  of  a model  while  using  "standard*'  FORTRAN  to  define  the 
details.  Some  of  the  complexity  of  the  "standard"  FORTRAN  will  be 
removed  by  optional  automatic  formatting  of  the  source  listings* 
For  example,  the  section  of  SMOOTH  shown  in  Figure  4.2  shows  how 
indentation  can  be  used  to  clarify  the  scope  of  the  various 
control  structures  such  as  DO  and  IF* 

Although  the  design  of  a language  cannot  provide  the  desired 
simplicity  automatically,  the  constructs  can  be  chosen  to 
influence  programming  style  in  the  desired  way*  Therefore,  the 
constructs  chosen  in  the  language  extension  are  few  and  general  in 
nature*  The  programs  written  using  these  constructs  should  be 
easily  understood,  hopefully  even  more  understandable  than  the 
same  algorithm  expressed  in  serial  constructs.  This  result  is 
expected  because  each  part  of  a program  can  be  kept  conceptually 
simple  and  because  the  relations  between  the  parts  of  the  program 
are  kept  simple.  The  constructs  chosen  also  make  the 
representation  of  discrete  models  more  natural.  These  constructs 
should  allow  simplification  of  the  abstraction  process  without 
losing  the  ability  to  make  efficient  use  of  machine  organization* 
In  addition,  subsequent  modifications  to  the  abstractions  should 
be  simpler* 

4*2*1. 2 Abstraction  and  Modeling 

Before  considering  the  proposed  language  constructs,  the  abstrac- 
tion and  simulation  processes  of  Figure  4*1  should  be  discussed  in 
more  cletail*  The  problems  faced  in  the  practical  use  of  the  NASF 
will  be  how  to  abstract  the  real-world  systems  of  interest  and  how 
to  control  the  simulation  of  such  systems  so  that  the  results 
would  be  a meaningful  adjunct  to  experimental  results* 

Since  a digital  system  cannot  directly  model  a continuous  process, 
the  abstractions  must  be  to  some  discrete-*^^ stem  representation. 
The  first  step  in  such  an  abstraction  is  to  identify  the  structure 
(and  substructure)  of  the  model*  For  example,  the  geometries  or 
grids  of  interest  would  be  defined.  Then  the  information  of 
interest  throughout  the  model  would  be  identified.  Such  informa- 
tion is  usually  called  "state"  information  since  it  describes  the 
current  state  (or  value)  at  the  point  of  the  model  with  which  it 
is  associated.  For  example,  when  studying  air  flow  around  an 
object,  wind  velocity,  wind  direction,  and  pressure  may  be  of 
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Figure  4,2  Example  of  Source  Code  Formatting 


interest  at  each  point  of  the  model*  At  the  same  time,  some  state 
information  may  apply  to  the  entire  model  (such  state  information 
is  invariant  over  the  model).  For  example,  the  Reynold’s  number 
of  the  fluid  flowing  around  an  object  might  be  assumed  to  be  the 
same  everywhere*  If  such  an  assumption  is  made,  one  value  can  be 
used  to  represent  the  entire  model* 

Real  systems  of  interest  are  usually  not  static.  They  show  some 
” behavior”  over  time.  Such  behavior  is  observed  through  the  state 
information*  Thus  some  means  must  be  provided  to  describe  the 
process  by  which  the  state  at  a particular  point  in  the  model 
changes  over  time.  Conceptually,  such  a process  exists  at  each 
point  in  the  model.  Note  that  these  processes  are  concurrent. 

The  language  constructs  chosen  below  provide  means  of  describing 
both  the  spatial  relationships  (geometry  and  state)  and  the 
temporal  relationships  (processes)  in  a model*  In  general,  stan- 
dard FORTRAN  constructs  are  used  to  define  the  process  of  state 
change  at  each  point  in  the  model  while  a new  construct  (called 
DOALL)  is  used  to  identify  the  natural  points  of  concurrency. 
Although  the  normal  FORTRAN  variable  and  array  mechanisms  are 
available  to  describe  the  state  of  the  model,  two  additional  con- 
structs are  defined  which  are  intended  to  make  the  abstraction  of 
models  more  straight  forward  as  far  as  geometry  and  state  vari- 
ables are  concerned  and  which  assist  in  efficient  usage  of  the 
storage  of  the  FMP. 

4*2.2  Language  Constructs 

The  language  called  FMP  FORTRAN  is  based  on  ANSI  FORTRAN  77  (X3.9- 
1978)  [10]  with  extensions  and  modifications  to  improve  its 

utility  for  use  for  the  planned  applications  and  to  allow  effi- 
cient use  of  the  projected  hardware.  FMP  FORTRAi4  is  expected  to 
implement  all  of  the  features  of  ANSI  FORTRAN  77  except  that 
CHARACTER  type,  all  usage  of  CHARACTER  type,  and  Input/Output 
Statements  are  as  defined  in  the  subset  FORTRAN  in  the  ANSI 
document  [10]. 

The  additional  constructs  described  in  the  following  sections  have 
been  motivated  by  the  abstraction  and  simulation  functions  already 
described*  The  three  major  areas  discussed  are  geometry,  state  of 
the  model,  and  process  modeling*  As  with  other  parts  of  the 
system  approach  discussed  in  this  report,  areas  for  continued 
improvement  certainly  exist.  However,  the  language  constructs 
reported  here  were  sufficient  as  far  as  the  specific  application 
programs  considered  are  concerned* 

4*2* 2.1  Introductory  Example 

To  introduce  the  basic  concepts  of  the  language  extensions,  a 
simple  example  will  Lj  considered  first.  The  sections  which 
follow  will  explore  each  of  the  major  areas  in  more  depth. 
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Figure  4.3  shows  the  main  computation  section  of  rURBDA  (from  the 
explicit  aerodynamic  flow  code).  Note  that  there  are  three  nested 
loops.  Also  note  that  the  computation  inside  the  loops  is 

independent  of  the  nesting  order.  The  variable  CVl  is  used  each 

time  the  inner  loop  is  evaluated  v.*ithout  change. 

Now  consider  Figure  4.4  which  is  a corresponding  PMP  FORTRAN 
version  of  the  code  in  Figure  4.3.  Note  that  the  nested  DO  loops 
are  replaced  with  a statement  called  ”DOALL"  at  the  start  of  the 
loop  and  with  a statement  called  ”ENDDO'*  at  the  end  of  the  loops* 
Hiis  version  would  execute  exactly  the  same  as  the  original 
version  if  only  one  processor  is  available.  However/  the  DOALL 

construct  gives  the  specific  information  that  the  computation  of 
the  inner  loop  for  each  combination  of  I,  J,  and  K values  is  inde- 
pendent of  all  other  I,  J K combinations.  In  other  words,  if 

enough  processors  were  available,  all  IL*JL*KL  instances  of  the 
code  in  the  inner  loop  could  be  computed  concurrently.  In  this 
case,  there  would  be  IIi*JL*KIj  copies  of  the  inner  loop  code  (one 
copy  per  processor)  . Execution  of  the  DOALL  statement  would  acti-- 
vate  all  IIi’*^JL*Klj  instances  simultaneously,  (one  per  processor)  , 
each  with  its  own  set  of  I,  J,  K values.  After  all  instances  had 
completed,  execution  would  continue  after  the  ENDDO  statement, 
just  as  control  passes  the  CONTINUE  statement  in  the  original  code 
when  all  loops  are  complete. 

From  an  applications'  standpoint,  the  DOALL  statement  identified  a 
grid  over  I,  J,  and  K*  The  arrays  El  and  RMUL  have  one  element 
(of  “state"  information)  corresponding  to  each  point  of  the  grid. 
The  variable  CVl  is  a "global”  state  variable.  The  code  between 
the  DOALL  and  ENDDO  statements  describes  the  process  of  changing 
state  variables  from  the  old  set  of  values  to  a new  set  of  values. 
Note  that  the  process  is  logically  different  along  the  J=1  and  K-1 
planes.  Ohe  evaluation  of  the  code  for  a specific  combination  of 
I,J,K  values  is  called  an  instance. 

The  compiler  is  informed  of  the  usage  of  variables  in  this  case 
with  the  last  part  of  the  DOALL  and  ENDDO  statements.  The  vari- 
ables and  arrays  listed  after  the  word  USING  in  the  DOALL  state- 
ment and  after  the  word  GIVING  in  the  ENDDO  statement  identify  the 
state  vai^iables  (usually  in  Extended  Memory). 

Before  considering  the  details  of  each  of  the  constructs,  consider 
how  each  of  the  instances  execute  and  use  memory.  In  the  case  of 
IL*JL*KL  processors,  all  processors  begin  execution,  each  on  its 
instance*  Each  processor  has  a copy  of  the  code  and  executes  out 
of  its  own  local  storage*  CVl  and  the  array  EI(I,J,K)  would  be 
referenced  in  the  common  Extended  Memory.  The  variable  TEMP  is 
completely  local  to  an  instance.  Therefore,  each  processor  would 
have  a storage  location  for  TEMP  as  used  in  the  instance  executing 
on  that  processor.  The  resulting  array  RflUL(I,  J,  K)  would  also 
be  in  Extended  Memory.  Since  all  IL*JL*KL  instances  need  the 
value  of  CVl,  space  would  be  allocated  in  each  processor  for  CVl 
and  tne  value  would  be  broadcast  to  all  processors.  This  approach 
costs  a little  storage  and  would  save  IL*JL*KL  -1  references  to 
the  slower  Extended  Memory. 


4-6 


4 

I' 

■A 
■ »' 


1^3800  CUi  = i,  /CV 

^3900  DD  i Ksi,KL 

4H000  DD  i Jai,JL 

44100  DD  1 I=i, IL 

44200  TEMPsftBS(EI(I, J,K^;XCVi 

44300  ir(K.EQ»l>  TEMPOS, 5AftBS<EI(Ijvljx>+EI<I,J,u)>  ACM! 

4440  0 ir(J,E©.l)  TEMPs=,15AflBS<EI<I,i,K;+EI<I,2,K>)3*cCVl 

44S50  0 RMUL<  X , J , K ^ «2 . 270  E-U  OMtiSSftT  < TEMP)*c:*t3  > / < TEMP*^i98 . 6 ) 

44600  X CDNTXNUE 


Figure  4*3  Section  of  TURBDA 
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Figure  4.4  FMP  FORTRAN  Version  of  Section  of  TURBDA 


4*2#2#2  Geometry 

When  planning  the  discrete  representation  of  a real-vorld  system# 
three  major  steps  are  usually  involved*  First  the  geometry  or 
general  structure  of  the  model  is  defined  together  with  any  sub- 
structure expected  to  be  used  during  definition  of  the  structural 
and  temporal  relationships  of  the  model.  In  general,  all  of  the 
discrete  points  or  elements  of  a model  will  be  incorporated  into  a 
set.  Algorithms  used  to  map  this  form  of  the  model  onto  the 
hardware  will  be  described  in  Section  4. 2.2. 2. 7. 

The  examples  of  the  proposed  language  constructs  which  follow  are 
based  on  physical  models  and  corresponding  cartesian  coordinate 
systems.  These  concepts  apply  as  well  to  transform  spaces. 

The  geometry  of  the  model  is  defined  by  first  describing  a sub- 
structure of  single  dimension.  This  substructure  is  then  used  to 
describe  structures  of  higher  dimension.  Since  the  points  of  the 
discrete  model  are  usually  identified  with  an  ordered  set  of 
integers,  the  construct  used,  called  DOMAIN,  is  capable  of  build- 
ing ordered  sets.  For  example, 

DOMAIN  /X/  ; L==l,  MAXX 

Here  the  name  of  the  domain  is  ”X".  The  domain  is  an  ordered  set 
of  values  as  defined  with  the  iroplied-DO  form.  If  MAXX=5,  then 
/X/  - {l,2,3,4,5}  . Two  such  linear  domains  can  be  used  to  define 
a two-dimensional  domain.  For  example: 

DOMAIN  /DAT/  : 1=1,  IMAX 
DOMAIN  /LON/  ; J=1 , JMAX 
DOMAIN  /LAYER/  : /LAT/.X./LON/ 

Here  the  domain  LAYER  was  defined  to  be  the  Cartesian  product  of 
the  two  linear  domains  LAT  and  LON.  The  result  is  that  LAYER 
consists  of  all  (I,J)  pairs. 

Two  forms  for  describing  geometries  of  interest  will  be  described 
below.  Ihe  first,  called  DOMAIN,  allows  the  user  to  define  the 
overall  structure  or  framework  of  the  model.  As  in  standard 
FORTRAN,  this  form  establishes  the  maximum  structure  of  interest 
in  the  problem,  and  is  used  in  the  mapping  to  hardware  to  properly 
allocate  storage  and  processors.  The  second  form,  called  REGICflN, 
is  used  to  dynamically  specify  an  arbitrary  set  of  elements  from  a 
domain. 


4.2. 2. 2.1  DOMAIN  Declarat ions . The  domain  declaration  can  take 
either  of  two  forms;  direct  specification  or  construction* 


For  example: 

DOMAIN  /JK/:  J=l,  10;  K=l,  15 
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Note  that  if  more  than  one  domain-variable-set  is  included,  the 
resulting  domain  is  assumed  to  be  the  cartesian  product  of  the 
individual  linear  sets  defined  by  each  doraain-variable-set ♦ In 
the  example  above,  the  domain  JK  consists  of  all  (J,  K)  pairs 
where  J is  from  the  set  1,2,3,  *.10  and  K from  the  set 
1,2, 3, ...15  . 

DOMAIN  /J/;  JJ  1,  10 
DOMAIN  /K/:  KK  = 1,  15 

DOMAIN  /JK/:  /J/.X./K/ 

The  last  domain,  JK,  is  defined  as  the  explicit  cartesian  product 
(cross-product)  of  the  sets  defined  in  domains  J and  K. 

The  syntax  charts  below  use  the  same  conventions  as  in  the  FORTRAN 
77  ANSI  standard  document  [10] ♦ 

The  syntax  of  the  DOMAIN  declaration  statement  is  as  follows: 


domain-statement : 


DOMAIN 


/-domain_name-/ 


Tdomain_spec_parameters  

domain  construct  expression 


A domain-name  follows  the  normal  FORTRAN  rules  for 
variable  naming,  and  may  not  be  the  same  as  the  name  of 
any  common  block. 


domain_variable_set : 

i 

‘ domain_variable  = integer_expressioriT“f  mteger^expression 


E 


J 
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A domain^variable  is  an  integer  variable. 

The  domain-variable  is  used  only  for  notational  conven- 
ience during  the  definition  of  the  domain*  The  domain- 
variable  does  not  remain  "attached"  to  the  domain  during 
execution.  Other  means  for  referencing  elements  of  a 
domain  are  described  below  (instance-identifier-lists  and 
instance-variables) * 

The  integer  expressions  are  evaluated  at  the  point  in 
the  program  where  the  domain  is  declared.  The  domain- 
variable-set  establishes  a sequence  of  values  exactly  as 
for  a DO-loop.  If  the  last  integer  expression  is  omitted 
it  is  taken  equal  to  1 by  default. 


A don  ain-construct-operator  is  a set  operator  used  to  construct  a 
new  set  from  two  previously  defined  sets.  The  defined  domain- 
construct-operators  are  as  follows: 

(union) . The  resulting  domain  includes  all  the  elements 
of  both  domains.  All  elements  in  the  resulting  domain  are 
unique  (duplicates  are  deleted).  The  dimensionality  of  the 
resulting  domain  will  be  that  of  the  operands  (which  must 
match) . 

.1*  (intersection).  The  resulting  domain  includes  only  ele- 
ments i:hat  are  present  in  both  domains.  Dimensionality  of  the 
operands  must  match. 

.X.  (product).  Each  element  of  the  resulting  domain  corres- 
ponds to  a pair  of  elements  (one  from  each  of  the  operands)  . 
The  dimensionality  of  the  resulting  domain  equals  the  sum  of 
tfie  dimensionality  of  the  two  operand  domains. 
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•N*  (relative  complement)*  The  resulting  domain  is  the  same 
as  that  of  the  first  (left-hand)  operand  with  any  elements 
which  occur  in  the  second  (right-hand)  operand  removed*  The 
dimensionality  of  the  operands  must  match* 

The  precedence  order  for  the  domain-construct-operators  is 
*X. # *N*,  •!*,  *U*.  Evaluation  is  from  left  to  right* 
Parenthesized  expressions  are  allowed* 

The  size  of  a domain  may  be  variable  even  if  the  domain  is  not 
a dummy  argument  to  a procedure*  The  domain  is  defined  at  run 
time  on  entry  to  the  procedure.  Each  procedure  invocation  may 
cause  a different-sized  domain  to  be  defined. 

Ohe  variables  defining  the  extents  of  the  domain  in  the  domain- 
variable-set  may  be  changed  during  execution  of  the  procedure. 
Such  change  does  not  have  any  effect  on  the  size  or  shape  of 
the  domain*  Once  the  size  and  shape  are  determined  on  entry 
they  are  fixed  for  the  duration  of  the  procedure. 

Dimens ional ity  is  the  number  of  domain-variables 
needed  to  define  the  domain. 

4*2. 2. 2. 2 Examples*  The  following  are  some  examples  of  legal 
DOMAIN  declarations  together  with  the  actual  DOMAIN  defined* 

DOMAIN  /LONG/  ; 1=1,4 
{1/  2,  3,  4} 

DOMAIN  /LAT/  : J = 1,  5 
U,  2,  3,  4,  5} 


DOMAIN  /ODDLAT/  : J = 1,  5,  2 

{1.  3,  5} 

DOMAIN  /NORTH/  : J = 5,5 

{5} 

DOMAIN  /SOUTH/  ; J = 1,  1 

■n] 

DOMAIN  /MIDLAT/  : /DAT/  .N.  /NORTH/  .N.  /SOUTH/ 

{2,3,4} 

DOMAIN  /LAYER/  ; /LAT/  .X.  /LONG/ 

{(1,1)  (2,1)  (3,1)  (4,1)  (5,1)  (1,2)  ...  (4,4)  (5,4)} 

DOMAIN  /LEVEL/  : K = 1,2 

{1,2} 

DOMAIN  /ATMOS/;  /LAT/  .X.  /LONG/  .X.  /LEVEL/ 

{(1,1,1)  (1,1,2)  (1,2,1)  (1,2,2)  ...  (5,4,2)} 

An  alternate  form  of  the  above  is 

DOMAIN  /ATMOS/  ; /LAT/  *X.  1=1,4  .X.  /LEVEL/ 
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4. 2. 2. 2. 3 Restrictions  The  prototype  compiler  should  have  a 
domain  dimensionality  restriction  to  four  domain-variables.  This 
restriction  would  help  limit  the  problem  of  mapping  to  hardware 
resources  to  an  acceptable  level  of  complexity.  This  restriction 
would  likely  be  lifted  with  later  releases  of  the  compiler. 

4. 2. 2. 2. 4 Scope.  The  scope  of  a domain-name  and  the  corres- 
ponding set  of  points  is  a program  unit.  The  scope  of  a 
domain-variable  is  the  domain-declaration  statement.  When  the 
same  domain  must  be  used  in  several  program  units,  it  must  be 
delcared  within  each  of  them  (like  a named  common  block). 

4. 2. 2. 2. 5 Required  Order.  The  position  of  a domain-declaration 
statement  within  a program  is  the  same  as  "other  Specification 


Statements” 

(see  Figure  4*  5) • 

PROGRAM,  FUNCTION,  SUBROUTINE,  or 
BLOCK  DATA  Statement 

IMPLICIT 

Statements 

Comment 

Lines 

FORMAT 

and 

ENTRY 

PARAf’lETER 
St atements 

Other 

Specification 

Statements 

DATA 

St atement 
Function 
Statements 

St atements 

Executable 

Statements 

END  Statement 

Figure  4.5  Required  Order  of  Statements  and  Comment  Lines 

in  a Program  Unit 


4. 2. 2. 2. 6 Application  and  Usage . TTie  DOMAIN  specification  will 

be  used  to  define  all  those  discrete  points  of  the  model  at  which 
state  information  and/or  processing  will  exist.  Each  discrete 
point  of  the  structure  or  substructure  of  the  model  will  be 
represented  by  an  element  in  some  domain. 
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4»2.2*2#7  Mapping.  The  geometry  as  specified  must  be  mapped  onto 
the  available  hardware  as  a mapping  under  which  the  actual 
modeling  and  simulation  will  run.  Many  possible  mappings  exist 
with  many  tradeoffs  to  consider. 

A simple  static  mapping  was  proposed  and  used  during  the  appli- 
cation analysis.  Such  a static  mapping  is  (iasiest  to  implement  , 
and  results  in  the  least  compiler  complexity.  V/ith  such  mapping 

the  least  run-time  overhead  is  devoted  to  mapping  and  concomitant 
data  rearrangements.  With  this  mapping,  the  linocir  representation 
of  a domain  is  mapped  to  its  corresponding  processor  number  modulo 
the  maximum  number  active  processors.  The  linear  representation 

of  a domain  is  the  same  as  the  storage  order  of  an  array  with  the 
same  subscript  values  and  subscript  variable  order  as  the  domain- 
-variables.  For  example  for  an  8 processor  system,  domain  LAYER 
would  be  allocated  as  shown  in  Figure  4.6.  The  compiler  code 
should  be  sufficiently  modular  that  other  mappings  can  bo  easily 
evaluated. 

The  static  mapping  was  used  during  the  application  analysis 
suiTunarized  in  Chapter  3.  The  application  analysis  results  show 
that  such  a static  mapping  will  support  the  applications  studied. 
Thus,  the  FMP  is  feasible  to  implement,  even  without  possibly  more 
elegant  mapping  techniques. 


DOMAIN  /LAYER/  ; J=l,5  .X.  1=1,4 
PROCESSOR 


0 

1 

2 

3 

4 

5 

6 

7 

(1,1) 

(2,1) 

(3,1) 

(4,1) 

(5,1) 

(1,2) 

(2,2) 

(3,2) 

(4,2) 

(5,2) 

(1,3) 

(2,3) 

(3,3) 

(4,3) 

(5,3) 

(1,4) 

(2,4) 

(3,4) 

(4,4) 

(5,4) 

Figure  4*6 


Modulo  Mapping  of  Elements  of  a DOMAIN  to  Processors 


Many  other  possible  mappings  exist.  Another  simple  static  mapping 
is  where  the  first  ”n"  elements  would  be  assigned  to  the  first 
processor/  the  next  ”n“  elements  to  the  next/  and  so  on.  Here  "n“ 
is  defined  to  be  the  next  integer  equal  to  or  larger  than  the 
total  number  of  elements  divided  by  the  number  of  processors.  In 
the  above  example/  domain  elements  (1,1)/  (2/1)/  (3/1)  would  be 

assigned  to  processor  0;  (4/1)/  (5/1),  (1/2)  would  be  assigned  to 
processor  1,  etc. 

Another  consideration  could  be  the  locality  of  roforence,  In  this 
case  elements  could  be  mapped  so  that  the  prc:>cossGS  associated 
with  all  elements  assigned  to  a prooossf^r  tend  to  reference  data 
already  physically  in  that  processor  thus  reducing  treUlic  to  the 
extended  memory.  Dynamic  allocation  strategies  could  alr>o  be  con- 
sidered. However,  dynamic  allocation  must  balance  the  benefits  of 
a possible  more  uniform  use  of  all  processors  with  the  likelihood 
of  increased  movement  of  data  to  and  from  the  common  memory.  (in 
a static  mapping/  variables  which  are  referenced  only  by  the 
instances  within  a processor  could  be  assigned  storage  space 
within  that  processor.)  Further  study  is  needed  to  determine  the 
most  cost-effective  strategy. 

4. 2. 2. 2. 8 REGION  Statement : Facilities  to  identify  a REGION  of 

active  interest  within  specified  DOMAINS  are  provided.  Hiese 
REGIONS  do  not  constitute  a separate  structure.  Essc/nt ially , a 
REGION  is  a virtual  domain  v;ith  dynamically  selected  t*iements  of 
the  original  DOMAIN.  The  elements  may  be  sparse  or  dense, 
rectangular  or  skewed  sections  of  domains.  The  REGION  declaration 
may  be  used  to  explicitly  create  a virtual  domain  with  dimension- 
ciiity  greater  than  its  original  domain  or  to  define  which  portion 
of  a DOMAIN  is  to  be  "processed”.  The  specification  ot  a REGION 
may  be  dynamic.  The  general  form  of  the  REGION  declaration  is: 

For  example 

REGION/ JKPART  (J=l/5;  K==l,9)/=  /JK(J+5,  K+2)/ 


The  values  of  J and  K specified  for  the  region  named  JKPART  are 
substltut"»d  in  the  expressions  associated  with  domain  JK  to 
determined  the  correspondences.  Specifically: 

JKPART  (1/1)  is  ’’equivalent"  to  JK  (6/3) 

JKPART  (2/1)  is  "equivalent"  to  JK  (7,3) 

JKPART  (2/9)  is  "equivalent"  to  JK  (7/11) 
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domain_construGt_expression*  i3  the  same  as  the  construct  defined 
earlier  except  the  / domain_narae  / part  may  not  be  used.  The 
domain_var table  names  are  not  variables  to  which  values  are 
assigned.  Rather  they  are  dummy  names  used  to  define  the  mapping 
which  identifies  that  part  of  the  domain  which  is  the  region  of 
interest. 

Each  of  the  integer_expressions  may  be  a linear  combination  of  the 
domain  varibles  used  as  part  of  the  REGION  declaration.  Refer- 
ences to  intrinsic  functions  will  be  truncated  to  integer  if 
necessary. 

If  the  REGION  is  to  choose  a one-to-one  mapping  to  DOMAIN  elements 
in  the  same  order  as  in  the  DOMAIN,  the  identifier  may  be  used 

instead  of  an  integer  expression  for  that  domain  dimension* 

4. 2.2*2. 9 REGION  Declaration  Examples.  Assume  that  the  following 
DOMAIN  declarations  exist: 

DOMAIN  /LAT/  : I = 1,  20 
DOMAIN  /LON/  ; J = 1,  30 
DOMAIN  /LAYER/  : /LAT/  *X.  /LON/ 

The  following  regions  correspond  to  the  drawing  in  Figure  4.7. 

REGION/LAYERI  (I=1,5;J=1,10)/=/LAYER  (I,J+10)/ 

REG I ON/LAYER2  ( I «1 , 1 0 ; J=1 , 1 0 ) / =/LAYER ( 1+ 5 , J ) / 

REGION/LAYERS  (I=1,10;J«1,10) /«/LAYER( MOD ( I+l 5 , IMAX ) , J+2 0 ) / 

Note  in  LAYERS  that  LAYERS  (1,1)  = LAYER  (16,1) 

LAYERS  (5,1)  = LAYER  (20,1) 

LAYERS  (6,1)  = LAYER  (1,1) 

LAYERS  (10,1)  LAYER  (5,1) 

4. 2. 2*3  Model  State 

After  defining  the  geometry  or  structure  of  a model,  the  state  of 
the  model  at  each  point  of  interest  in  the  defined  structures  and 
substructures  needs  to  be  described.  Ttie  state  can  be  described 
through  the  use  of  the  various  variable  types  and  declarations 
available  in  FORTRAN*  Since  with  the  applications  of  interest, 
the  state  variables  are  the  same  at  all  points  throughout  the 
structure,  a new  construct,  called  INALL,  was  defined  to  help 
simplify  the  description  of  the  state  throughout  a structure*  For 
example,  the  declaration: 

REAL  INALL  /LAYER/  WNDVEL,  WNDDIR,  T,  P 
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I 

j defines  the  variables  called  WNDVEL,  WNDDIR,  T,  and  P.  Unlike 

» standard  FORTRAN,  this  declaration  specifies  that  each  variable 

occurs  "INALL"  of  the  discrete  points  defined  by  the  domain  called 
j|  ATMOS.  In  other  words,  a different  variable  WNDVEL  exists  at  each 

I point  or  element  of  the  domain  LAYER.  (Recall  that  LAYER  was 

defined  in  the  example  above  as 

I DOMAIN  /LAYER/;  1=1 , 20 ; J=1 , 30 

Variables  defined  with  the  INALL  statement  can  be  used  in  FORTRAN 

!the  same  way  as  dimensioned  variables  as  described  later*  In  such 
a use/  the  subscripts  identify  the  point  (element)  of  interest  in 
the  structure* 

I The  result  of  the  INALL  statement  is  a set  of  wind  velocity/  wind 

direction,  temperature  and  pressure  variables  at  each  point  of  the 
domain*  The  storage  reserved  would  be  the  same  amount  as  a 
I dimension  statement  of  the  form 

DIMENSION  WNDVEL( 20/30) / WNDDIR( 20 / 30 ) , T(20/30)/  P(20/  30) 

I However,  unlike  variables  declared  with  standard  dimension  state- 

ments, the  **  inall-variables”  can  be  considered  to  be  simple, 

■ unsubscripted  variables  when  defining  the  process  to  fie  simulated 

I at  each  point  in  the  model,  as  described  later*  When  the  names  in 

• the  INALL  declaration  have  dimensionality,  the  implied  subscript 

positions  of  the  domain  variables  precede  the  subscript  positions 
I whose  dimensionality  is  explict* 

4. 2* 2* 3.1  INALL  Declarations*  The  INALL  declaration  would  take 
i the  form; 
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If  a type  is  declared/  it  applies  to  each  of  the  inall_variables 
listed.  The  inall  variable  can  be  a variable  name,  an  array  name 
or  array  declarator.  If  no  type  is  specified/  each  inall_variable 
on  the  list  will  be  implicit  type  or  as  specified  in  a separate 
type  statement. 

The  INALL  declaration  serves  the  dual  function  of  declaring  the 
type  of  the  variables  on  the  list  and  of  declaring  storage  require-- 
ments.  The  INALL  declaration  semantically  indicates  that  each 
element  of  the  domain  defined  has  associated  with  it  variables 
identified  in  the  list.  Thus,  if  there  are  10  variables  on  the 
list  and  if  the  domain  declared  has  3 dimensions  with  extents  of 
3/  4/  and  5/  then  the  storage  space  reserved  would  be  3*4*5*10  or 
600  storage  units. 

4.2. 2. 3. 2 Scope ♦ The  scope  of  the  INALL-var iable  name  is  a 
program  unit*  If  an  INALL  variable  is  in  a named  common  block, 
all  names  in  that  named  common  block  must  match  in  the  several 
program  units  where  they  occur. 

4. 2. 2. 3. 3 Appl icat ion  and  Usage.  Each  element  of  a domain  will 
include  the  set  of  declared  inail-variables . That  is,  a unique 
set  of  inall-variables  will  exist  for  each  point  in  the  domain* 
The  language  constructs  used  for  referencing  these  variables  are 
described  in  Section  4. 2. 2. 6. 

4.2.2. 3. 4 Mapping . The  physical  storage  allocated  for  the  inall^ 
variable  set  corresponding  to  each  point  of  the  specified  domain 
v;ould  be  allocated  to  the  storage  of  a physical  processor  in  the 
same  manner  that  the  domains  are  mapped  (see  Section  4. 2. 2. 2. 7). 
As  a result  each  processor  will  contain  as  many  inall-variable 
sets  as  the  number  of  elements  assigned  to  that  processor  for  each 
domain.  Figure  4.8  is  an  example  of  this  allocation. 

The  purpose  of  the  DOMAIN  and  INALL  statements  is  to  define  an 
application-oriented  data  structure.  The  structure  is  hier- 
archical. The  major  divisions  (the  grid)  are  defined  by  the 
DOMAIN  statement.  The  subdivisions  are  defined  by  the  INALL  state- 
ment and  consist  of  the  state  variables  and  arrays  defined  to 
exist  at  each  grid  point.  The  data  structure  definition  is  inde- 
pendent of  how  the  structure  is  mapped  onto  the  storage  of  the 
PMP. 

4*2. 2. 4 Process  Modeling 

Once  the  structure  and  state  of  the  model  have  been  defined,  some 
means  of  describing  the  process  to  be  modelled  must  be  provided. 
This  description  is  done  in  two  stages.  First,  the  general  flow 
of  the  sequence  of  events  which  occur  during  the  process  would  be 
described.  This  general  flow  description  allows  the  dependence  of 
subprocesses  over  time  to  be  defined.  Second,  the  detailed 
relationships  that  exist  within  the  defined  structures  and 
substructures  are  defined.  These  relationships  exist  for  each  of 
the  events.  The  dynajnic  "behavior”  of  the  discrete  system  is 
defined  by  the  combination  of  the  general  flow  and  the  detailed 
relationships. 
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standard  FORTRAN  is  a language  with  inherent  sequential  depen- 
dencies, That  is^  execution  is  constrained  to  be  one  statement  at 
a time  in  a well-defined  order.  The  general  flow  does  have  such  a 
sequential  dependency.  Standard  FORTRAN  control  constructs  can, 
therefore,  be  used  to  describe  the  sequential  dependency. 
However,  when  modeling  real  processes,  there  are  also  concurrent 
actions  which  must  be  considered.  If  the  language  provides  a 
means  for  describing  the  concurrency  naturally  inherent  in  the 
processes  being  modeled,  the  the  mapping  of  the  abstracted  process 
to  the  hardware  will  be  more  straight  forward  and  the  user  should 
find  it  easier  to  define  the  abstraction.  This  concurrency  in  the 
general  flow  can  be  described  with  the  construct  called  DOALL* 

4. 2. 2. 5 DOALL  Construct 

The  basic  form  of  the  DOALL  construct  was  shown  in  an  earlier 
example  (section  4. 2. 2.1).  Recall  that  a segment  of  standard 
FORTRAN  code  (which  describes  the  computation  required  to  evaluate 
the  process  of  getting  new  values  from  old  values)  is  started  with 
a DOALL  statement  and  ended  with  an  ENDDO  statement.  Figure  4.9 
is  another  example.  This  is  a section  of  the  FMP  FORTRAN  version 
of  SMOOTH  (see  Appendix  A for  a discussion  of  the  application 
code).  Note  the  region  THREED.  This  region  is  three-dimensional. 
Variables  SS,  Tl,  T2 , T3,  T4  and  a vector  CT(5)  are  defined  at 

• each  point  in  the  domain  MODEL.  The  statement  marked  is  the 

control  statement  that  indica^s  that  all  statements  irom  that 

poiTnt  to  the  statement  marked  (?)  are  considered  to  be  replicated 
(one  set  of  statements  to  each  point  in  the  domain),  that  each  set 
of  statements  (called  an  instance)  is  independent  from  all  other 
sets  and  that  all  sets  of  statements  could  execute  concurrently 
(given  sufficient  resources).  Itie  code  marked  (?)  tests  to 

determine  whether  the  particular  domain  point  is  in  either  the  J=2 
or  J-JMAX-1  plane.  If  so,  then  the  next  few  statements  are 

executed.  If  not,  the  statements  following  the  ELSE  are  executed. 
Note  that  these  sequencing  decisions  within  an  instance  are  based 
on  data  unique  to  the  instance  and  do  not  depend  on  any  other 
instances.  Also  note  that  the  statement  following  the  one  marked 
(^  is  not  permitted  to  begin  until  all  instances  have  completed 
execution. 

It  is  interesting  to  note  that  if  there  are  fewer  processors  than 

instances  to  be  evaluated,  then  the  work  would  be  spread  out 

across  the  processors.  Each  processor  would  evaluate  more  than 
one  instance.  Since  only  one  set  of  code  would  be  required  per 
processor,  multiple  instances  are  executed  simply  by  cycling 
through  the  code  for  that  segment.  The  term  cycles  is  used  to 
indicate  how  many  instances  a processor  has  evaluated  of  the 
currently  executing  doall  construct.  The  allocation  of  specific 
instances  to  processors  would  be  static  and  would  use  the  scheme 
previously  discussed  with  regard  to  assignment  of  elements  of  a 
domain  to  processors  (which  is  the  same  ptvjLicm)  . 
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SUBROUTINE  SMOOTH 

CDMMON/BftSE/NMRX,  JMrtX,  KMflX,  LMRX,  DT  , SAMHH,  i3AMI  , rSMftCH, 

* DXijDYi,  DZi,FV<.5  ■/ , rO(5)  , hD,  fiLF  , 6D,  OME  Gft  • HDX  , HD  V , hDZ  , KM, 

3 CNBR, PI , ITR, HP,  INTI,  XNT3,  INT3 
DOMRIN  /MODEL/;  J::l,xU0;  K&:1,50;  L::j.,300 

REGION  /THREEDi  <Js<i,JMHX-x;,  vKs£,KMRX-x;,  <L:i3,UMRX-x;  )/ 

X ;=  /MDOEL<  J,K,  L^/ 

ZNflLU  .MODEL/  ^ 6 ^ , S v 5 > , SS  , CT  C ^ , Tl  , 7 3 , T 3 , T 4 

4TH  ORDER  SMOOTHING,  3D  ORDER  AT  THE  BOUNDARIES 
DDALL  /THREEDC J, K, L ;/  , USING  Q,  S,  SMU 

TEMP  — x»/(j!v.J,K,L,U^ 

DO  X N:ix,5 

!..iv,Nj>—  C(i%J,K,LfNJxQvU,K,L,6j/ 

CONTINUE 

IF  (j,ed,3  ,or,  j , eg  , jmax-1  j THEN  ‘ iilLriT  (^FTHl' 

7xut/,UTx,K,L,UJ  ■ I y\  ’ \ - \\  ^ ■*’  P(  K )R 

>3  “ Gv,J**x,K,L,oji 

DJ  3 NSX,S 

SS  - S<J,K,L,N^  T U , 5wSMUXvG(  Jtx,  K,  L,  mxTX>“  3 . •>  C T i N J t 

1 Gy|J  — x,K,L,N^xTL)xTEMP 

CONTINUE 
ELSE 

DO  3 NSX,5 

T1=G(Jt3,K,L,6> 

T3~Gv  U**3,  K«L,Uj> 

V3=GCJtx, K, L, U; 

T4  = GvJ**1,K,L,U; 

SS  ::  S<J,K,L,Ny  t SMUx^GvUt3,K,L,N;xT1  + G^J-u,f-,L,N^xT3  t 
X 4,XiGvJtX,k,u,N^xT3  t GvU"*-*‘,^^,^,N;;xT4ji  »•  0,xCTvNj;;ixTEMP 

CONTINUE 
E N D I F 

EHDDO/THREED/ 


Figure  4*9  Section  of  FMP  FORTRAN  Version  of  SMOOTH 


4. 2. 2. 5.1  Construct  Definition.  The  specific  form  of  the  DOALL 
construct  is: 


doall_construct ; — *doall_statement doall_program_segment 


d 


enddo  statement 


The  doall  statement  is  defined  as  follows: 


doall  statement: 


' DOALL 


domaia-ident if ier 


domain  specifier 


U 

— ^ — use  list ^ 


The  main  purpose  of  the  DOALL  construct  is  to  identify  those 
processes  (in  the  doall  program's eg me nt  which  can  be  concurrently 
evaluated.  The  doall_statement  identifies  the  points  (grid 
values)  at  which  the  doall  program  segment  would  be  evaluated. 
The  points  may  be  previously  defined  (as  a domain  or  region)  or 
may  be  defined  as  part  of  the  doall  statement. 


^ domain  identifier: 


domain^name 
region  name 


3 


(-p—  inst  ance_variable 

' f ‘ 


/ 


Each  instance-variable  is  an  integer  variable  and  is  unique  from 
all  others  on  the  list.  Each  instance-variable  represents  one  of 
the  dimensions  of  the  declared  domain.  Instance-variables  are  not 
required  to  be  the  same  as  the  domain-variables  used  when  the 
domain  was  originally  declared.  For  each  instance  the  set  of 
values  assigned  to  each  of  the  instance-variables  at  the  start  of 
evaluation  of  each  instance  is  the  set  of  values  used  to  uniquely 
identify  that  instance.  The  scope  of  the  instance-variable  is  the 
doall-program-segment . The  instance_variables  are  allocated 
storage  in  the  local  processor  memory. 

domain  snecifier: 

i ‘ 

domain_construct_expression  


/ domain_name  / : 


The  domain-name  could  be  included  on  the  ENDDO  statement  which 
terminates  the  construct  (to  assist  in  program  readability)*  When 
a domain  is  declared  as  part  of  the  DOALL  statement  itself,  its 
scope  is  local  to  the  DOALL-prog  ram-block  itself.  The  domain- 
variables  used  in  the  domain-specification-parameters  become 
instance-variables  as  described  above  in  addition  to  being  used  to 
determine  the  extent  of  each  of  the  dimensions  of  the  domain. 

The  doall-program-segment  is  a set  of  PMP  FORTRAN  statements  which 
describe  the  process  to  be  evaluated  at  each  point  specified.  In 
terms  of  the  model,  the  process  defined  in  the  doall-program- 
segment  is  conceptually  evaluated  at  all  points  simultaneously. 
In  addition,  the  process  at  any  one  point  does  not  have  access  to 
the  values  computed  by  the  same  process  at  any  other  point  in  the 
model*  Figure  4.10  shows  this  independent,  concurrent  structure 
in  a "flow-chart”  form.  The  evaluation  of  a doall-program-segment 
at  a given  point  is  called  an  instance.  All  instances  of  a 
doall-program-segment  will  complete  execution  before  executing  the 
next  statement  in  sequence  after  the  ENDDO.  Although  conceptually 
all  instances  execute  concurrently,  the  actual  order  of  execution 
is  dependent  on  the  processing  resources  available.  The  only 
implementation  requirements  are  that  all  instances  must  complete 
execution  prior  to  continuing  with  the  next  statement  after  the 
ENDDO. 

nse  list  is  to  spet  ifically  identify  which  variables  or  arrays 
are  used  within  instances  of  the  doall-program-segment.  The 
specific  form  is: 


The  purpose  of  this  USING  clause  is  explained  in  more  detail  later 
(section  4. 2. 2. 6). 

The  last  statement  of  a doall  construct  is  the  enddo  statement. 
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generate  list ; 


i GIVING 


variable  name 


/ common_block_id  / 


— array_name 
V domain  name  / 


iDie  

> 

. / 

V 

J 


The  generate  list  specifically  identifies  those  variables  or 
arrays  produced  for  reference  upon  completion  of  the  doall_ 
construct  * 


4.2»2»5*2  Serial  FORTRAN  Equivalent  Form*  Any  DOALIi  construct 
can  be  simply  represented  in  standard  FORTRAN  with  nested  DO 
loops*  The  depth  of  nesting  matches  the  dimensionality  of  the 
domain  over  which  the  DOALL  is  defined*  In  fact,  the  domain- 
variable-sets  (See  Section  4. 2* 2* 2*1)  used  to  define  the  domain 
become  the  control  part  of  the  DO  statements. 

4*2*2*5*3  nested  DOALLS  * The  doal I -prog ram- segment  may  itself 
include  a DOALL  construct.  Since  the  application  programs  of 
interest  did  not  require  this  construct,  no  evaluation  of  possible 
run-time  efficiency  or  inefficiency  has  yet  been  made.  Since 
dynamic  resource  allocation  has  not  been  proposed  yet,  nested 
DOALL’s  would  be  statically  decomposed  to  serial  form* 

4*2*2. 5*4  Mapping  * The  mapping  of  the  doall-program  segments  is 
the  same  as  that  described  for  an  element  of  a domain.  However, 
since  each  instance  executes  the  same  doal 1-program-segment , only 
one  copy  of  a program-segment  need  be  kept  by  each  processor* 

4. 2*2*5. 5 Restrictions*  No  instance  of  a doall-program-segment 
may  reference  the  results  of  the  current  computation  of  any  other 
instance  in  the  same  do ail-prog ram-segment  * Each  do all-prog ram- 
segment  has  access  to  all  of  the  values  of  the  model  at  the  start 
of  that  prog ram- segment  * The  entire  DOALL  construct  must  be 

treated  as  a whole  in  order  to  control  the  implementation  and  use 
of  the  construct.  For  example,  consider  a hypothetical  system 
where  such  a restriction  did  not  exist  and  suppose  that  the 
computations  performed  in  one  instance  did  use  the  value  of 
variables  in  another  instance  of  the  current  doall-program 

segment*  Under  these  conditions,  successive  runs  of  tin?  program 
are  likely  to  get  different  results  since  the  time  oider  of 
execution  of  the  two  program  segments  is  not  necessarily  tl\e  same 
from  one  run  to  the  next.  As  a result,  the  variable  values 
fetched  from  the  second  instance  are  either  old  values  or  ..re  new 
values,  but  with  no  control  or  ^knowledge"  of  the  f‘ru’om[^aBsing 
program  that  such  a variation  occurred.  Programs  W(  ulu  be  very 
difficult  to  debug  in  such  a hypothetical  system. 


since  no  referencing  between  instances  of  the  currently  executing 
doall*-program*-soginont  is  allowerl,  the  results  of  evaluation  are 
completely  independent  of  the  order  of  execution* 

Because  of  the  concurrency  expressed  in  the  DOALL  construct,  the 
arbitrary  transfers  of  control  which  are  allowed  in  standard 
FORTRAN  must  be  restricted  in  FMP  FORTRAN*  No  transfers  ir-to  a 
DOALL  construct  may  be  made,  V'/itliin  a DOALL  construct,  normal 
FORTRAN  control  constructs  (IF,  GO  TO,  *..)  arc  allowed,  but 
control  must  remain  within  the  DOAhL  construct*  AJ 1 instances 
exit  via  the  BNDDO*  No  transfers  of  an  ij'.stanct*  are  allowed* 

4 * 2 • 2 * 6 Variable  Ko  f e re  n c i ng 

The  standard  FORTRAN  refcntincing  conventions  apply.  One  extension 
has  been  defined  t(;  siin^^lify  the  description  oJ  the  models  of 
interest*  This  extensif:>n  supports  tlie  concept  of  "centered-* 
subscripts" * 

4* 2. 2. 6*1  Kef et  encing  Within  a POALh- prog ram -segment  > The  DOALL 
construct  described  above  is  used  to  define  the  time  se»guencing  of 
a modeling  process.  At  each  discrete  time  litep,  some  sort  of 

interaction  betwceji  tin-  various  parts  o)  a i;u>d«  l takes  place*  In 
particular,  tlic  modeling  task  may  involv<.‘  thc‘  use*  oi  general  state 
variables,  of.  state  \ariabLes  unique  to  eacli  el*  runt  of  a doioain 
and  intermediate  variables  used  during  tin.  (’valuation  of  a 
process.  General  state*  variables  are  used  to  e'.:r;ci- :i  bv  u'l  uveral.1 
process  or  structure  . nu  may  be  referenced  v/i'dn;  eac)j 
Ihe  r.tate  vai;  ialdefi  (icd  j nc.nl  at  .*ach  i>oint  (.ii  -i  doi;..  • . n.ay  bi.* 

I.  ef eroieed  ny  proc'esses  d<.*i  im-(J  ai  olJier  1 , tlie 

inter^iKMi  i .<t  e vaJues  used  duri  rnj  the  evaluation  ol  a pmest;  would 
be  ol.  (’ouo  rn  only  tin  eacli  instance  and  not  to  «uiy  'the  * ‘ . • 

<’>rd'>r  to  hav'  an  orderly  flov/  ol.  data  and  allocation  o:  ;-.l:orag«  in 

the  sysL(.i:j,  Language  constructs  ijuvi..*  i;ec.ui  prop-i  ;ed  wi.ieh 

relate  to  the  ,ibovv*  dot>ev\denc  i cs . 

The  gene r al  s t at  e vai  i al d (?s  ( t hose  va r i ab J es  wli  i c) i aj '[ > 1 y to  ai  i 

points  of  a donnun  or  region)  will  )je  called  GLOBAL  variables. 
Those  state  variables  v/hlch  have  been  defined  for  eat  h of  the 

points  of  a domain  will  be  called  v^TRUcTURE  variables.  Any 
variables  witli  values  geme rated  and  used  only  witliin  an  instance 
will  be  called  fiOCAL  variabios.  Note  that  GIjOBAL  variables  would 
not  be  defined  using  I NALL  statemonts. 

The  differentiation  of  tV'Cse  different  "classes  ol  use"  is  impor- 
tant in  a multiprocessor  such  as  the  FMP  because  oi  the  added 
complexities  of  storage  allocation  and  storage'  managt.'m*  *nt  ♦ The 
additional  constructs  already  defined  provide  application-oriented 
ways  to  define  variable  usage  to  the  compiler* 
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The  USING  and  GIVING  claunon  of  tho  doal.1  and  onddo  ntatomunta  am 
a natural  way  to  explicitly  define  t\vi  dat a*”dependency  <>i  the  nyn- 
tern  modeled  at  the  procesB-'level . 'Ihe  compiler , in  turn^  will  ubo 
thon<>  fitatementfj,  to^jether  with  anaiynin  of  tlje  nource  , to 

produce  code  to  initiate  early  requeotn  oi  data  trannferB  from  KM 
to  the  )>rocesBorB  an<3  back,  thun  further  improving  throughput  by 
allowing  more  overlap  of  execution  with  fetching  data  from  tlie 
ICx tend ed  H eiiui r y . 

Any  varial>le  uneci  witliin  a doall-program-Bogjnent  but  not  d<*clared 
in  the  USINCa  clau,'u.‘  immt  be*  j^elf--def  i ni  ng  within  each  iinitance 
prior  to  itn  use-,  If  a variable  is  not.  include<i  ir^  a USINCJ  or 
GIV1N(5  clausic  the  implioatic^n  is  tliat  the  varialjle  is  only  ne(»ded 
temporarily  during  th<.-  r.-val  nation  of  the  [)rocess.  Thus,  in  or<Jer 
to  consider  storage*  ret^u  ircsiients,  varial>Ies  n<^t  declare-d  eitluir 
USING  or  GIVING  need  exist  only  f<jr  eacli  "active"  instance  rather 
than  for  each  instance,  (An  "active"  instance  is  an  instance 
which  is  being  oxeculc^d  by  a processor  resource,) 

If  a variable  ir*.  included  on  a USING  or  GIVING  clause,  but  is  not 
Included  in  an  INAhfi  d(,-clarat  ion,  the  implication  is  that  all 
instances  of  the*  deal  l*~prograin-segment  will  reference*  the  same 
variable  (GLOBAL  variable)  • Wlien  this  condition  occurs,  tlie  com- 
piler  would  allocate  space  for  such  a variable  in  each  processor 
and  generate  code  which  would  cause  the  value  of  such  a variable 
to  be  broadcast  to  all  processors  rather  than  requiring  each 
instance  to  separately  request  access  to  that  variable.  Variables 
of  this  sort  were  previously  called  "CONTROL"  [1,2], 

If  a variable  is  included  on  both  a USING  or  GIVING  clause  and  on 
an  INALL  declaration,  the  implication  is  that  each  instance  will 
require  its  corresponding  INALL  variable  (recall  tnat  INALL 
creates  a variable  for  each  point  in  the  domain).  Special  sub- 
script forms  defined  in  the  next  section  can  be  used  to  reference 
INALL  variables  in  other  instances. 

Figure  4,11  summarizes  the  variaole  use  interpretations  based  on 
the  statements  defined. 

The  importance  of  the  independence  of  the  instances  of  a doall 
program  segment  has  already  been  pointed  out.  All  GLOBAL  and 
STRUCTURE  variables  as  well  as  all  instance  identifiers  (used  to 
identify  the  set  of  indices  which  define  the  point  in  the  domain) 
can  be  considered  to  be  preinitialized  to  their  value  upon  entry 
to  the  DOALL  construct.  At  that  point,  at  least  conceptually,  the 
evaluation  rules  within  a particular  instance  are  intc‘rpreted  in 
classical  FORTRAN  fashion  except  that  the  values  assigned  to 
GLOBAL  or  STRUCTURE  variables  can  be  referenced  only  by  the 
instance  which  did  the  assignment.  All  other  instances  would 
continue  to  reference  the  original  values.  Similarly,  a set  of 
instance  identifier  variables  would  exist  for  each  instatice.  The 
initial  values  in  the  set  for  a particular  instance  would  identify 
the  instance.  Any  changes  made  to  the  values  in  on(i  set  could  not 
be  observed  within  any  other  instance.  The  FMP  (hardware  and 
software)  will  enforce  these  referencing  procedures. 


in  any  USING  or  GIVING  clause 

YES 

NO 

declared 

YES 

STRUCTURE 

LOCAL 

INAliL 

on 

DOMAIN 

NO  j 

GLOBAL 

LOCAL 

Figure  4.11  Variable  Use  Interprohation 


4, 2. 2. 6. 2 Centered-Subscripts.  The  intent  of  the  new  constructs 
described  (DOMAIN,  INALL,  DOALb, . . . ) has  been  to  allow  the 
description  of  a model  and  the  modeling  process  as  it  reflects  the 
process  and  state  at  each  discrete  pioint  of  the  structures  of 
interest.  References  by  the  doall-program-seginent  to  variables  in 
the  same  element  of  the  DOMAIN  as  the  instance  need  only  be  by  the 
simple  name.  For  example, 

REAL  INALL  /ATMOS/  T,WNDVEL,  AB(7) 

is  a statement  declaring  variables  T and  MNDVEL  and  a vector  AB  at 
each  element  of  the  domain  ATMOS.  In  the  following  program  seg- 
ment, the  process  defined  compu  ’s  new  values  which  are  a function 
only  of  old  values  in  the  same  instance; 


DOALL  /ATMOS  / USING  WNDVEL,  AB 
T = (AB(1)  + AB(2) )/2 

WNDVEL  « (WNDVEL  + AB(3)  + AB( 5) )/2*AB(4) 
ENDDO  /ATMOS/GIVING  T,  WNDVEL 


Many  models  have  dependencies  between  elements  of  the  structure. 
When  describing  processes  of  this  type,  a natural  approach  is  to 
describe  the  process  centered  at  a particular  element  and  consider 
the  rest  of  structure  with  respect  to  the  centered  element.  V'Jhen 
referencing  INALL  variables  in  other  instances,  a suscript  is  used 
in  a manner  similar  to  normal  array  and  vector  referencing  in 
standard  FORTRAN. 
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Another  example  might  be: 


DOALL  /ATMOS  (I,J,K)/  USING  T 
T - T(I,J  i) 

ENDDO  /ATMOS/  GIVING  T 


Here  the  new  value  of  T at  each  element  of  the  structure  is  made 
equal  to  the  original  value  of  T on  the  lower  plane  of  the 

structure*  (i*c*  All  elements  of  a column  of  /ATMOS/  have  the 

^ same  value  of  T as  the  value  of  T in  the  first  element  of  the 

column.  ) Note  that  I and  J are  constants  throughout  the  doall*- 
program-segment  since  they  are  the  instance-variables.  The  ***“ 

may  be  used  to  indicate  the  value  of  the  instance-variable  corres- 
ponding to  the  element  of  the  domain*  For  example,  another  way  of 
writing  the  preceding  example  is; 


DOALL  /ATMOS  / USING  T 
T ^ T(^,*,  1) 
ENDDO/ATMOS/GIVING  T 


When  variables  in  adjacent  elements  of  a domain  are  to  be  refer- 
enced, subscript  expressions  may  be  used.  For  example: 


REG J ON/CENTRAL  ( L-1 , IMAX-2 ;M=1 , JMAX-2 ; N=1 , KMAX-2 )/ 

=/ ATMOS ( I+1,J+1,K+1)  / 

DOALL/ CENTRAL! I, J,K) /USING  T 

T = (T  + T(I+1,*,*)  + T(I-1,*,*)  + T(*rJ+l,*)  + T(^,J-1,*) 
1 + T(*,*,K+1)  + T(*,*,K-1) )/7 

ENDDO/CENTRAL/GIVING  T 
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In  this  example  a REGION  was  declared  that  excluded  the  outer 
boundary  of  ATMOS.  The  DOALL  computed  new  values  of  T based  on 
all  immediately  adjacent  values.  Note  that  variables  declared 
INALl.  over  a DOMAIN  are  also  accessible  to  any  REGION  of  the 
DOMAIN  just  as  if  they  had  been  declared  INALL  on  the  REGION. 
Also  note  that  values  of  T at  adjacent  elements  of  the  DOMAIN  are 
used  to  compute  new  values  of  T at  each  element  of  the  REGION. 
All  computation  is  based  on  the  values  of  T throughout  the  DOMAIN 
upon  entry  to  the  DOALL  construct. 

As  a last  example,  note  that  a doall--program~segmont  only  treats 
values  of  INALL  variables  as  having  initial  value  upfjn  entry  if 
the  GIVING  or  USING  clause  specified  those  variables.  During 
execution  of  a program-segment,  the  variables  may  locally  be 
assigned  other  values.  Only  the  centered-variables  are  saved 
under  the  GIVING  clause. 

REAL  INALL/ATMOS/  T,  VJNDVEL 
DOALL/ CENTRAL ( I, J,K) /USING  T 
TOLD  = T 

T = (T  + T (I+l,*,*)  + ,^))/3 

XY(1)  - T 

XY(2)  = (TOLD  + TJ^^J+l,*)  + T(* , J-1, * ) )/3 
XY(3)  = (TOLD  + T (*,*,K+1)  + T( * ,K-1) )/3 
T = (XY(1)  + XY{2)  + XY(3))/3 
ENDDO  /CENTRAL/  GIVING  T 

In  this  example  only  T(*,*,*)  is  saved  upon  completic>n  of  all 

instances.  The  array  XY  and  the  variable  TOLD  are  LOCAL  vari- 
ables. These  variables  are  used  only  by  the  active  instance.  In 
order  to  conserve  storage,  the  same  processor  memory  locations 
used  for  LOCAL  variables  during  execution  of  an  instance  in  a pro- 
cessor can  be  used  for  the  LOCAL  variables  of  another  instance 
when  more  than  one  instance  of  a doall_program_segment  are  eval- 
uated in  any  given  processor.  Note  that  the  original  value  of 
T (*,*,*)  had  to  be  saved  since  the  second  statement  t:lianged  the 
value  (as  far  as  the  particular  instances  was  concerned).  In  this 
way  execution  of  a doall-segment  is  the  same  as  thot  of  any 

FORTRAN  segment  with  the  INALL  variables  specified  in  GIVING 
clauses  initialized  as  if  with  a DATA  statement  upon  each  entry* 

4*2. 2.6. 3 Unreferenced  Variables.  In  some  cases,  a variable 

identified  within  a separately  compiled  sgement,  but  never  be  used 
within  that  segment.  This  happens,  for  example,  if  the  main 
program  has  a named  common  area  that  is  used  in  a number  of  sub- 
routines, and  the  area  must  exist  in  the  main  program  for  the 

purpose  of  holding  data  created  by  one  of  the  subroutines  and  used 
by  the  other.  In  this  case,  the  compiler  will  not  have  access  to 
the  USING  and  GIVING  declarations,  because  of  the  separate  compil- 
ation. Until  a better  way  of  handling  this  situation  defined, 
the  declaration  of  these  named  common  areas  will  be  expanded  by 
prefixing  them  with  an  indication  of  how  they  will  be  used,  when 
they  are  used. 
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STRUCTURE  declares  that  the  variables  and  arrays  within  the 
given  common  block  will  be  used  as  if  they  had  been  included 
in  INALL  statements  and  USING  or  GIVING  clauses* 

GLOBAL  declares  that  the  variables  and  arrays  in  the  given 
common  block  will  be  used  as  if  they  had  been  included  in 
USING  or  GIVING  clauses. 

All  variables  and  arrays  in  a given  named  common  must  be  used  in 
the  same  way  (i.e.  as  STRUCTURE  or  GLOBAL). 

4. 2 *2. 7 Storage  Allocation 

The  Flow  Model  Processor  has  two  major  areas  of  storage  to  be 

concerned  with  during  execution,  the  main  memories  of  the 
processors  and  the  extended  memory.  The  primary  use  of  extended 
memory  is  for  the  STRUCTURE  data  (the  "old"  state  values). 
Processor  memory  is  allocated  to  program,  and  to  data  storage 
space.  The  data  storage  space  is  further  divided  into  temporary 
areas  used  only  while  an  instance  is  active  (the  LOCAL  variables) 
and  into  areas  which  are  allocated  to  each  instance  resident  in 

the  processor.  The  data  areas  allocated  to  instances  hold  the  NEW 
values  as  well  as  copies  of  OLD  values.  Note  that  although  many 
instances  of  a process  may  be  assigned  to  a particular  processor, 

‘ only  the  data  areas  reflect  that  assignment.  Only  one  copy  of  the 

program-segment  would  exist. 

I The  GLOBAL  variables  normally  have  storage  space  allocated  both  in 

the  processor  memories  and  in  the  EM.  This  allocation  is  a space- 
time  tradeoff.  If  only  the  original  copy  existed  in  the  EM,  then 
I each  instance  would  have  to  access  it  separately  with  potential 

' conflicts  (when  more  than  one  proce;  sor  try  to  access  the  same  EM 

location  simultaneously,  only  one  it.  granted  access;  any  others 
j wait).  If  the  value  is  broadcast  to  all  processors  simultaneously 

] (say  at  the  start  of  a doall),  then  any  references  would  be  to  the 

local  copy  already  resident  in  each  processor. 

[ 4.2.2, 8 Independent  Compilation 

Program  units,  as  v;ith  any  conventional  FORTRAN,  may  be  separately 
\ compiled.  Note  that  there  may  need  to  be  a distinction  between 

! two  i:lasses  of  subroutines.  One  class  would  be  those  called 

within  a doall  program_segment . These  subroutines  would  be  com- 
I pleteLy  independent  of  any  coordinator  code  and  would  have  any 

I embedded  DOALL  constructs  implemented  as  nested  DO  loops.  The 

other  class  of  subroutine  would  be  those  called  outside  a doall_ 
program_segment . If  a subroutine  of  this  class  did  have  an  embed- 
i ded  DOALL  construct,  both  coordinator  and  processor  code  would 

' have  to  be  generated  in  order  to  take  advantage  of  the  available 

processors. 

One  solution  to  the  above  situation  is  to  somehow  identify  one 
class  of  subroutine  from  the  other.  This  could  easily  be  done 
with  a simple  construct  added  to  the  SUBROUTINE  statement  itself. 
For  example 
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SUBROUTINE  BTRI  DOMAIN  1=1, IM/ 

This  would  indicate  that  BTRI  would  be  called  within  a two- 
dimensional  doall  and  that  IM*JM  copies  of  the  subroutine  should 
be  available  to  the  instances  of  the  doalls. 

Other  solutions  exist*  They  include  independent  compilation  for 
checking  purposes  but  full  compilation  to  generate  code  files* 
Another  solution  would  be  to  provide  information  concerning 
location  of  doall_constructs  to  the  linker  and  have  the  linker 
include  coordinator  code  where  needed.  All  of  the  above  solutions 
are  still  under  consideration  to  determine  the  most  effective 
solution. 


4. 2. 2* 9 Code  Generation 

The  compiler  will  produce  code  for  both  the  coordinator  and  all 
processors.  A very  straightforward  division  of  control  would 
exist.  That  code  required  to  synchronize  DOALLs  and  to  support 
interaction  with  the  external  environment  would  be  resident  on  the 
coordinator.  All  other  code  would  be  allocated  to  the  processors. 
Ihe  DOALL  constructs  just  described  are  simple  examples  of  this. 
The  processors  would  each  contain  a copy  of  the  doall-program- 
segment  together  with  some  identification  of  that  segment.  When 
the  flow  of  control  of  the  program  arrives  at  the  DOALL,  the 
coordinator  would  broadcast  a "start  segment  n"  command.  When  all 
instances  have  completed  and  all  processors  have  notified  the 
coordinator  with  "I  got  here",  the  coordinator  would  synchronise 
the  updating  of  OLD  values  in  the  EM  followed  by  initiating  the 
next  program-segments  in  the  processors. 

Program  segments  which  are  not  part  of  DOALL  constructs  but  which 
are  standard  serial  FORTRAN  could  be  analyzed  by  the  compiler  to 
determine  any  data  dependencies.  Separate,  data  independent  code 
sequences  would  be  defined  with  the  appropriate  conditional  tests 
so  that  each  processor  would  evaluate  one  section  of  the  resulting 
program  segment.  The  controls  in  the  coordinator  would  be  the 
same  as  for  the  DOALL  case  (in  effect,  a "DOALL"  would  have  been 
constructed  out  of  the  original  code).  Since  the  processors  can 
all  operate  autonomously,  this  approach  should  result  in  addi- 
tional speed-up  on  serial  codes.  A speedup  cn  the  order  of  2-5 
over  straight  serial  execution  is  likely  from  this  approach.  The 
application  analysis  summarized  in  Chapter  3 DID  NOT  assume  this 
speed-up  of  sections  of  serial  code.  Note  that  a separate  high- 
speed "scalar"  processor  is  not  required.  Each  processor  is 
independently  capable  of  scalar  execution,  so  that  concurrency  is 
not  dependent  upon  vectorization,  as  it  is  in  today’s  vector 
machines  * 
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4.2*2.10  Functions 


Functions  on  the  PMP  will  include  not  only  the  normal  mathematical 
intrinsics,  such  as  ATAN,  LOG,  EXP,  and  SQRT,  but  also  a family  of 
functions  that  are  brought  about  because  of  the  parallel  nature  of 
the  PMP*  The  global  intrinsics,  which  reflect  the  parallel 
structure  of  the  system,  are  described  in  more  detail  below* 
Table  4.1  lists  the  functions  which  could  be  provided  in  PMP 
FORTRAN.  In  addition  to  listing  the  function,  the  table  also 
lists  the  expected  implementation  (such  as  operator,  in-line 
expansion,  or  calls  on  external  function  subprograms.  Some  of  the 
functions  will  combine  in-line  code  with  external  calls  and  are 
marked  for  both. 

4*2*2.10.1  Global  Functions ♦ The  global  functions  have  no  analog 
in  a seriaT  machine  and  are  not  normally  used  in  the  direct 
description  of  a model.  These  functions  are  useful  in  the 
simulation  controls  and  in  the  summary  and  analysis  of  the  results 
of  a simulation. 

The  global  functions  operate  across  the  declared  parallelism 
defined  in  the  model  structures*  Fcr  example,  the  following 
serial  FORTRAN 


A ^ 0.0 

DO  1 J=l,  100 

A = A+B(J) 

1 CONTINUE 

would  be  replaced  by 

DOALL  J=l,  100  USING  B 
A = SUMALL  (B(J)) 

ENDDO  GIVING  A 

Note  that  this  is  implemented  in  two  levels*  First  the  sum  of  all 
the  instances  assigned  to  a given  processor  generate  a partial 
sum.  At  the  end  of  the  DOALL  construct,  the  coordinator  generates 
Log2  P (P  = # Processors)  operation  sequences  to  associate  pairs 
of  partial  sums  to  get  a set  of  higher-order  partial  sums.  These 
sums  are  then  paired  and  summed  successively  until  one  value 
remains. 
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FMP  INTRINSIC  FUNCTIONS  (Cont'd) 


PMP  INTRINSIC  FUNCTIONS  (CDnt'd) 
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Table  4.1  Intrinsic  Functions  (Cont.) 

Notes  for  Table: 

Note  1*  The  value  returned  by  INT  is  that  integer  of  the  same 
sign  as  a with  a magnitude  not  larger  than  |a|  ♦ If  a is  too 
large,  integer  overflow  is  reported. 

Note  2.  The  representation  of  these  functions  in  the  FORTRAN 
source  will  use  the  standard  double-asterisk  notation  for  exponen- 
tiation. The  function  called  will  depend  on  the  data  types  of  A 
and  B in  A*’*^B.  The  external  function  called  is  an  alternate  entry 
into  the  EXP  function  subprogram. 

Note  3.  In  PMP  f ormat , FIRST  and  SNGL  are  different  names  for  the 
same  function. 

Note  4.  The  values  of  i mark  the  elements  of  a domain.  These, 
and  the  following  functions  occur  across  all  those  instances  of  a 
DOALL  that  execute  the  statement  containing  the  function.  Thus, 
with  2 arguments  showing  in  the  code,  there  will  be  2x  imax  actual 
arguinents,  where  imax  is  the  size  of  the  domain. 

Note  5.  LOCTRU  finds  the  instance  number  of  the  instance  in  which 
the^  previous  MAXALL,  MINALL  found  a minimum.  LOCTKU  with  a 
logical  argument  finds  the  instance  number  of  one  of  the  instances 
in  which  that  variable  is  true* 

Note  6*  RECURRENCE  is  discussed  below. 
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The  result  of  a global  function  is  not  available  for  use  within 
the  DOALL  in  which  it  is  called.  Since  the  various  instances  of  a 
doall-program-segment  may  be  executed  in  arbitrary  order,  any 
given  instance  may  complete  before  some  other  instance  has 
provided  its  input  to  the  global  function.  Thus,  the  output  of 
the  function  is  not  defined  until  the  execution  of  the  last 
instance.  The  results  of  the  global  function  are  available  after 
control  passes  the  ENDDO. 

4.2.2.10.2  LOCATION.  The  LOCATION  function  operates  with  the 

assumption  that  the  value  returned  is  the  instance  number  of  the 
successful  instance  of  the  most  recent  execution  of  MAXALL, 
MINALL,  . . . The  subsequent  use  of  this  instance  number  as  a 
subscript  depends  on  the  implicit  equivalence  between  a one- 
dimensional subscript  and  a unique  multiple-subscript. 

For  example,  given  a structure  variable  A declared  INALL  over  a 
domain  the  largest  element  of  the  array  could  be  determined  as 
follows; 

DOMAIN  /LAYER/ ! 1=1,10000 

REAL  INALL  /LAYER/  A 

DOALL  /LAYER/  USING  A 

IPTR  = LOCATION  (GLOBALMAX( A) ) 

ENDDO  /LAYER/  GIVING  IPTR 
PRINT  A(IPTR) 

4.2.2.10.3  RECURpNCE.  The  RECURRENCE  function  is  only  defined 
over  domains  active  bn  one-dimension.  The  RECURRENCE  function 
would  be  invoked  as  shown  in  the  example  below; 

A(J+1)  = RECURRENCE  (A( J) *B+C( J) ) 

where  A is  declared  INALL  across  the  DOMAIN. 

■Rie  prototype  compiler  should  implement  only  the  simple  form  of 
recurrence,  with  B constant.  The  additive  terra  need  not  be 

subscripted  and  may  be  missing.  The  constant  B may  be  omitted 
when  it  is  equal  to  1. 

RECURRENCE,  the  global  operation,  is  the  formation  of  a parallel 
linear  recurrence  in  nine  («log2512)  steps  as  demonstrated  by 
Shyh-Ching  Chen  in  his  doctor's  thesis  at  the  U.  of  Illinois  [13], 
In  FORTRAN,  consider 

DO  1 J=l,  512 

A(J+1)  = A(J)*B  + C(J) 

1 CONTINUE 
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This  program  segment  takes  512  steps,  each  with  one  multiply,  and 
one  add.  The  parallel  algorithm  in  RECURRENCE  produces  the  same 
result  in  nine  steps. 

Although  global  sums,  global  products,  and  the  parallel  linear 
recurrence  are  functions  in  the  language,  they  are  not  always  the 
optimum  programming  method  for  producing  these  particular  results. 
For  example,  take  the  serial  FORTRAN  below. 

DO  1 J=l,1000 
DO  1 K=-l,1000 

A(J,K+1)  = A(J,K)  * B(J  ) + C(J,K) 

1 CONTINUE 

There  are  several  ways  to  write  this  in  PMP  FORTRAN  given  that  the 
order  of  nesting  the  loops  is  irrelevant  otherwise.  IWo  of  them 
are: 

Method  I : 

DOALL  J«l,1000 
DO  1 K=l,1000 

A{0,K+1)  = A(J,K)  *B(J  ) +C(J,K) 

1 CONTINUE 
ENDDO 

Method  II; 

DOALL  K=l,1000 
DO  1 J=l,1000 

A(J,K+1)  + RECURRENCE  (A(J,K)  * B(J  ) + C(J,K)) 

1 CONTINUE 
ENDDO 

Method  I,  which  executes  the  recurrence  serially  in  an  inner  loop, 
runs  over  nine  times  as  fast  as  method  II,  which  executes  each  one 
of  the  recurrences  in  parallel  across  each  value  of  J in  turn. 
That  is,  method  I is  512  times  as  fast  as  a single  processor, 
while  method  II  is  57  times  faster  than  a single  processor.  The 
global  functions  are  included  for  those  cases  where  method  1 is 
not  an  available  option. 

4.2.2.10.4  Efficiency  of  GLOBAL  Functions.  The  global  functions 
are  logarithmic  in  efficiency  for  domains  up  to  512  in  size.  That 
is,  it  takes  nine  steps  to  produce  the  512-way  result  across  all 
512  processors.  For  larger  domains,  the  global  function  is 
executed  serially  with  respect  to  all  those  instances  executed  on 
each  processor  (called  CYCLES) . As  a result,  the  number  of  steps 
required  for  SUMALL,  for  example,  is  N/512  + 8 where  the  domain 

has  N elements. 
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4*2*2.1G.5  Direct  Calls  on  Global  Functions*  An  alternative 
construct  for  global  functions  fil 

global-f unct ion-name  /domain-name/  ( argument-list ) 

For  example: 

B :=  SUMALL/DD(J)/(A(J)) 

is  equivalent  to 

DOALL  /DD(J)/  USING  A 
B « SUMALL(A(J)) 

ENDDO  /DD/  GIVING  B 

This  form  is  the  equivalent  of  single-statement  DOALDs  when  the 
statement  is  a global  function. 

Boolean  global  functions  may  be  used  directly  in  IF  statements 
once  evaluated.  For  example: 

IF  (ANY  /DD(J)/  (A(J)))'l.. 

is  equivalent  to 

DOALL  /DD(J)/  USING  A 
DUMMY  - ANY  (A(J)) 

ENDDO  /DD/  GIVING  DUMMY 
IF  (DUMMY)  ... 

When  LOCATION  directly  follows  such  an  implied  single-statement 
DOAI.L,  the  compiler  combines  it  into  the  DOALL  of  the  previous 
global  function. 

For  example 

MM  = MAXALL/DD/(A(J)) 

IX  = LOCATION 

is  equivalent  to 

DOALL  /DD/  USING  A 
MM  ==  MAXALL  (A( J)  ) 

ENDDO  /DD/  GIVING  MM 
IX  = LOCATION 


4.2.2*11  Assignment  Statements 


The  following  pertains  to  execution  within  each  instance  of  a 
doall-program-segment.  Recall  from  section  4. 2. 2. 6.1  (and  Figure 
4,11)  that  three  classes  of  variables  exist  in  doall-program- 
segments.  ISiey  are  called  GLOBAL,  STRUCTURE,  and  LOCAL. 

All  STRUCTURE  variables  have  their  old  values  when  any  instance 
begins  execution.  Assignment  to  any  structure  variable  from 
within  an  instance  will  result  in  the  new  value  being  available 
unly  within  that  instance.  Other  instances  would  still  refer  to 
the  old  value  unless  they  too  had  executed  a similar  assignment 
statement.  Once  all  instances  are  complete,  the  STRUCTURE  vari- 
ables are  all  updated  with  the  new  values  computed  within  the 
instances. 

Assignment  to  a GLOBAL  variable  or  array  element  will  redefine  the 
value  of  that  variable  within  the  instance  in  which  the  assignment 
is  made.  However,  the  original  value  of  the  variable  remains 
available  for  reference  by  any  other  instance.  Since  a GLOBAL 
variable  must  have  only  one  value,  a doall-program-segment  may 
assign  new  values  to  GLOBAL  variables  only  through  a GLOBAL 
function  which  maps  a set  of  STRUCTURED  variable  values  onto  a 
single  value.  Such  a new  value  is  available  only  after  the  ENDDO 
statement.  All  other  apparent  assignments  to  a GLOBAL  variable 
within  the  DOALL  define  the  GLOBAL  variable  to  the  end  of  the 
instance. 

Assigi.ment  to  LOCAL  variables  may  take  place  at  any  point  during 
execution  of  an  instance.  Operation  is  as  with  standard  FORTRAN 
except  that  upon  completion  of  the  instance,  the  storage  space 
allocated  to  such  LOCAL  variables  would  be  reassigned  upon  exit 
from  the  doall-program-segment.  A compiler  option  will  exist  such 
that  .,OCAL  variables  would  be  assigned  unique  storage  locations 
for  each  instance.  In  this  case,  LOCAL  variable  values  would 
carry  over  from  one  reference  to  another,  even  between  different 
DOALL  constructs. 

Extern;-.!  to  a DOALL,  all  references  and  assignments  to  GLOBAL  and 
STRUCTURE  variables  are  valid.  In  such  a case  STRUCTURE  variables 
must  be  subscripted. 

4.2.2.12  Miscellaneous  Features 

4.2.2.12.1  Same-line  Comments.  A reserved  character,  not  in  the 
FORTRAN  character  set,  will  be  defined  that  may  be  used  to 
terminate  a statement.  Thus,  anything  following  on  the  same 
physical  card  is  comment.  A likely  character  is  "%". 

When  the  syntax  of  a statement  is  such  that  the  only  allowable 
characters  on  the  rest  of  the  card  are  blanks,  the  compiler  will 
not  check.  Thus,  statements  like  ENDIP,  IF  (-boolean-)  THEN  allow 
comments  to  be  placed  on  the  rest  of  the  card. 
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4*2*2*12*2  Recursion ♦ Recursive  calls  are  allowed*  Note,  that 

although  the  second,  nested  call  on  the  subroutine  gets  a second 
set  of  subroutine-local  variables  and  arrays  (separate  from  the 
set  belonging  to  the  outer  call)  any  named  common  that  belongs  to 
the  subroutine  will  be  the  same  named  common  area  in  both  calls* 

4*2*2*12*3  DO  LOOPS*  Since  a domain  consists  of  a finite  ordered 
set,  the  control  of  a DO  loop  can  be  specified  with  a set  of 
domain  elements.  For  example: 

DOMAIN  /LAT/i  1=1, IM 
DO  1 /LAT/ 

is  equivalent  to 

DO  1 

If  the  domain  is  multidimensional,  the  order  of  nesting  of 
’’implied”  DO  loops  is  FORTRAN  subscript  order*  That  is,  the  first 
variable  is  the  index  of  the  inner  loop.  The  last  variable  is  the 
index  of  the  outer  loop* 

4*2*2*12*4  EXIT  Statement*  The  EXIT  statement  can  be  used  to 

terminate  an  individual  instance  of  a DOALL  construct.  In  addi- 
tion, a DO  loop  may  be  terminated  with  an  EXIT  statement*  EXIT 
statements  are  permitted  wherever  executable  statements  are 
allowed. 


4*1.2*12.5  Dynamic  Array  Sizes*  Space  is  not  allocated  for  a 
named  common  until  the  first  program  unit  using  that  named  common 
is  entered*  Likewise,  space  is  not  allocated  for  variables  and 
arrays  of  a program  unit  until  that  program  unit  is  entered* 
Hence,  sizes  of  common  blocks  and  dimensions  of  arrays  can  be  and 
may  be  set  dynamically  during  program  execution.  The  only  require- 
ment is  that  the  expression  determining  the  size  be  evaluated  at 
the  point  in  the  program  where  the  declaration  occurs*  In  the 
case  of  arrays  in  named  common  areas,  the  size-determining 
expressions  must  evaluate  to  the  same  value  in  every  program  unit, 
or  a run-time  error  is  likely* 

4*2*2.13  Input  Output 

All  FMP  input  and  output  is  staged  via  the  Data  Base  Memory. 
Since  I/O  is  inherently  serial,  a mapping  of  concurrent  execution 
to  the  serial  form  supported  by  peripherals  is  required.  That  I/O 
specified  within  the  serial  parts  of  FMP  FORTRAN  programs  occurs 
as  specified.  That  I/O  specified  within  a DOALL  over  a DOMAIN  or 
a REGION  is  processed  as  requested  over  time.  Since  the  instances 
of  a DOALL  are  independent,  no  attempt  to  order  I/O  of  one 
instance  with  respect  to  another  is  made  although  the  time 
sequence  of  I/O  within  any  one  instance  is  maintained. 
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Formatted  I/O  is  expected  to  be  supported  primarily  by  the  Support 
Processor  since  the  major  formatting  load  is  on  output.  In  addic- 
tion^ the  applications  studied  were  such  that  input  formatting 
could  be  accomplished  prior  to  initiation  of  an  FMP  task.  As  a 
result  all  FMP  I/O  would  be  direct  I/O  via  the  DBM.  These  assump- 
tions have  affected  the  instruction  set  choices  in  the  processors 
of  the  FMP.  No  powerful  character  handling  instructions  exist  at 
this  time.  Due  to  the  heavy  loading  of  output  formatting  to 
support  the  COM  load  (excess  of  10,000  frames  of  graphic  info/ 
day),  continued  consideration  is  being  given  to  moving  formatting 
support  onto  the  FMP.  The  system  as  evaluated  (with  Support 
Processor  formatting  functions)  could  certainly  support  the 
expected  workload.  The  remaining  questions  pertain  to  whether  a 
more  cost-effective  solution  might  exist. 

4.2.3  BW  FORTRAN  Compiler 

As  with  any  large  design  problem,  a compiler  development  project 
involves  a number  of  stages  including  some  means  of  testing  design 
ideas.  The  compiler  discussed  below  is  actually  envisioned  to  be 
a succession  of  compilers  beginning  with  what  would  best  be 
described  as  a "prototype  FMP  FORTRAN  compiler”.  Where  appro- 
priate, these  discussions  will  point  out  features  or  capabilities 
planned  for  the  prototype  compiler  or  planned  to  be  deferred  to 
later  versions. 

The  FMP  FORTRAN  compiler  would  execute  on  the  Support  Processing 
System.  Source  input,  generated  code  and  other  output  would 
reside  in  the  NASF  Pile  System. 

4. 2. 3.1  Functional  Objectives 

The  functional  objectives  of  the  compiler  are: 

4. 2. 3.1.1.  Support  to  the  User.  Not  only  should  compile-time 
messages  be  clear,  but  run-time  aids  should  be  provided  for 
debugging,  for  gathering  statistics  and  for  monitoring  the  dynamic 
execution  of  a program.  Other  facilities  should  include  gener- 
ation of  optional  memory,  array  and  extent  bounds  checks. 

4.2. 3.1. 2 Support  of  the  Language.  The  defined  language  (FMP 
FORTRAN)  would  be  the  language  supported  by  the  compiler.  No 
changes  to  the  language  or  compiler  would  be  made  without 
consideration  of  the  other. 

4.2. 3.1. 3 Make  Efficient  Use  of  FMP  Resources.  The  compiler  may 
never  be  capable  of  implementing  the  ^most  efficient"  use  of  FMP 
Resources.  This  inability  is  due,  in  part,  to  the  data-depend- 
encies  which  are  run-time  sensitive  and,  in  part,  to  the  com- 
plexities of  global  optimization. 
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The  prototype  PMP  FORTRAN  compiler  is  expected  to  implement 
limited  optimization#  The  level  of  optimization  at  the  prototype 
stage  would  be  "peephole"  optimization  giving  improved  overlap  of 
the  FMP  functional  elements  during  execution#  For  example, 
register  allocation  could  be  adjusted  so  that  the  store  to  memory 
ending  one  statement  would  follow  the  first  fetch  or  two  belonging 
to  the  succeeding  statement#  Requests  for  data  from  EM  (LOADEM’s) 
would  be  positioned  near  the  start  of  a program-segment#  This 
position  should  make  it  possible  for  CN  delays  to  occur  concurrent 
with  processing#  Where  possible,  integer  and  floating  point 
instructions  would  be  rearranged  to  improve  overlap#  Cptimi- 
zation  of  this  sort  requires  local,  straight-^', ^v\»ard  data  flow 
analysis  probably  using  the  register  addresses  as  data 
identifiers# 

Since  static  allocation  of  the  defined  processes  onto  the  memory 
and  processor  resources  is  planned,  the  resources  might  not  be 
used  as  efficiently  as  in  a dynamic,  "load-leveling"  run-time 
allocation  scheme#  Unfortunately  no  efficient,  yet  simple, 
dynamic  scheme  has  been  studied  as  yet.  As  experience  is  gained, 
static  optimization  will  occur  in  two  major  areas;  data  or 
storage  allocation  and  processor  allocation#  For  example,  as  data- 
dependency  analysis  improves,  code  can  be  generated  which  main- 
tains STRUCTURE  variables  always  within  a processor  if  all 
instances  which  refer  to  those  variables  are  also  within  the  same 
processor#  Data-dependency  analysis  would  also  likely  be  used  to 
assign  instances  of  DOALLS  to  processors  on  the  basis  of  least 
communication  with  Extended  Memory# 

Another  means  toward  meeting  the  goal  of  efficient  use  of  FMP 
resources  is  to  generate  efficient  object  code.  Some  of  this 
efficiency  will  be  derived  from  classical  compilation  techniques 
(feasible  since  most  of  the  task  involves  generating  code  for 
individual  processors).  Some  of  the  efficiency  will  come  because 
of  the  simplicity  of  having  only  one  program  in  execution  at  a 
time. 


4. 2. 3.1# 4 Support  the  Operational  Environment#  The  FMP  Compiler 
would  be  able  to  provide  the  necessary  linkages  to  the  logical 
input-output  subsystem#  In  addition  the  compiler  would  produce 
the  necessary  information  for  the  linkage  editor# 

Since  the  proposed  FMP  organization  is  very  modular  and  is  likely 
to  be  implemented  first  with  a limited  number  of  processors,  an 
option  which  must  be  available  with  the  prototype  compiler  is  to 
compile  for  "N"  processors  and  ”M"  memories#  This  capability 
should  add  considerably  to  the  time  available  for  debug  and  system 
integration  of  the  software  since  not  all  512  processors  need  be 
available  to  begin  system  testing# 


4*2. 3. 2 Functional  Organization 

Figure  4*12  shows  the  expected  functional  organization  of  the  FMP 
Compiler.  The  internal  interface  between  all  components  shown 
would  be  a common  representation  of  the  compiled  program.  Such  a 
common  representation  should  allow  the  development  of  compiler 
design  and  debugging  aids.  For  example,  the  source  generator 
module  could  be  used  at  any  phase  of  compiler  execution  to 
generate  a record  of  the  current  state  of  compilation. 

4. 2. 3. 3 Domains 

The  prototype  compiler  would  handle  only  rectangular  domains.  In 
addition^  the  domains  would  be  constrained  to  a maximum  of  four 
domain  variables  with  constant  spacing.  These  restrictions  are 
suggested  to  reduce  the  prototype  compile  complexity.  The  hard- 
ware proposed  would  tolerate  any  kind  of  index  set  as  a domain. 
Language  features  have  yet  to  be  proposed  for  describing  such  non- 
rectangular  domains. 

4.2- 3.4  Data  Plow  Analysis 

Data-flow  analysis  is  not  required  to  produce  executable  code  so 
the  prototype  compiler  is  not  expected  to  have  such  an  analysis 
capability*  However,  the  compiler  can  do  a much  better  job  of 
optimizing  when  a data-flow  analysis  is  included.  One  of  the 
chief  uses  of  data  flow  analysis  would  be  to  improve  memory 
allocation  decisions.  For  example,  if  more  structure  variables 
can  be  held  in  processor  memory,  the  number  of  EM  fetches  and 
stores  would  be  reduced  with  a likely  improvement  in  throughput. 

4.2. 3. 5 Memory  Allocation 

Memory  allocation  is  static  in  the  sense  that  only  one  program 
occupies  the  FMP  at  any  given  time,  and  that  the  same  variable  in 
that  program  always  occupies  the  same  memory  address  if  the  same 
run  is  repeated.  Allocation  is  dynamic  in  the  sense  that  space  is 
allocated  to  named  common  areas  only  when  the  first  program  unit 
using  them  is  entered;  space  is  allocated  to  variables  local  to  a 
program  unit  only  then  that  program  unit  is  entered;  and  these 
spaces  are  deallocated  when  the  last  program  unit  using  these 
local  variables  is  exited.  Hence,  the  same  physical  memory  area 
may  successively  be  allocated  to  local  variables  in  a number  of 
program  areas.  As  mentioned  earlier,  an  option  would  be  available 
such  that  no  deallocation  of  unused  memory  space  would  occur. 
This  option  would  be  useful  if  data  values  are  carried  from  one 
call  of  a subroutine  to  the  next. 

Program  and  data  areas  have  no  relationship  to  each  other.  They 
would  be  separately  managed.  In  fact,  separate  calls  on  the  same 
subroutine  from  different  places  may  have  the  local  working  space 
of  the  subroutine  allocated  to  different  places  in  memory,  and  the 
code  file  for  that  subroutine  will  not  have  moved. 
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Figure  4*12  Functional  Organization  of  FMP  FORTRAN  Compiler 
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4t2t3.6  Subroutine  Entry  and  Return 


The  subroutine  entry  and  return  mechanism  would  be  essentially 
that  of  standard  Burroughs  machines.  This  mechanism  allows  the 
deallocation  of  unused  memory  space  rather  than  requiring  the 
space  of  all  subprograms  to  occupy  physical  memory  addresses  even 
during  the  time  it  is  not  being  used.  One  of  the  integer 
registers  would  be  used  as  a stack  pointer.  It  points  to  a 
"return  control  word"  (RCW)  which  contains;  a)  the  memory  address 
of  the  RCVJ  of  the  procedure  calling  this  one^  b)  the  program 
counter  sotting  to  which  return  should  bo  made,  and  c)  the  size  of 
the  memory  area  required  by  this  program.  Upon  subroutine  entry, 
the  size  field,  plus  the  number  of  parameters  to  be  passed,  is 
added  to  the  stack  pointer,  a new  RCW  is  built,  and  written  into 
memory  at  the  new  stack  pointer.  Upon  subroutine  return,  the 
stack  pointer  and  program  counter  are  loaded  from  the  RCW.  The 
parameters  that  are  passed  include  the  base  addresses  of  any 
shared  named  common  areas,  and  pointers  to  any  variables  or  arrays 
that  are  passed  by  name  (in  FORTRAN,  all  explicit  parameters  are 
passed  by  name.  However,  there  is  some  implicit  passing  of 
parameters  by  value,  as  when  calling  a mathematical  function.) 

The  result  of  managing  subroutine  working  space  as  a stack  is  that 
recursive  subroutine  calls  are  allowed,  even  though  there  seems  to 
be  no  use  for  them  in  aero  flow  and  weather  codes. 

4. 2. 3.7  Concurrency 

In  the  prototype  compiler,  the  only  concurrency  allowed  will  be 
that  of  all  the  instances  of  a single  DOALL.  All  instances  would 
be  executing  copies  of  the  same  code  file.  Execution  sequencing 
dependent  on  which  domain  element  an  instance  is  associated  with 
could  be  controlled  by  testing  the  instance-variables  to  determine 
which  element  they  represent.  Nested  DOALLs  would  have  the  inner 
DOALL  implemented  as  an  ordinary  DO  loop* 

The  hardware  is  not  constrained  to  have  all  processors  executing 
out  of  the  same  code  file.  Thus,  in  principle  some  instances  of  a 
DOALL  could  have  one  sequence  of  code,  and  other  instances  could 
have  some  otlier  sequence,  but  this  would  not  be  allowed  in  the 
prototype  compiler. 

Capabilities  for  operations  in  which  the  processors  operate  asynch- 
ronously with  no  synchronization  are  not  provided.  Neither  are 
capabilities  provided  in  which  a few  processors  are  allowed  to 
execute  code  separately  from  the  other  processors  which  are  using 
the  coordinator  for  synchronization. 

4.2. 3-8  Duplexed  Computation  Mode 

A comj:>ilor  option  planned  (but  not  for  the  prototype  compiler)  is 
to  generate  the  code  and  controls  to  execute  each  sequence  of  code 
twice  but  with  the  spare  processor  switched  between  executions. 
In  this  mode,  ail  execution  occurs  in  a different  set  of  proces- 
sors on  the  second  pass*  The  results  of  the  two  passes  would  then 
be  compared  as  a confidence  test  for  highly-reliable  results. 
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4.3 


OPERATING  SYSTEM 


The  NASP  should  have  only  one  operating  system^  pieces  of  which 
execute  on  the  various  portions  of  the  system*  In  the  discus- 
sions below r this  operating  system  is  called  the  Master  Control 

Program  (MCP).  Ihe  purpose  of  the  MCP  is  to  provide  software 

support  for  the  following: 

1.  Scheduling  and  controlling  the  flow  of  programs  and  files 
to  and  from  various  processors  in  t))e  system  (including 
the  Support  Processing  System  and  the  PMP) , 

2.  Initiating  staging  of  joos  onto  the  FMP, 

3.  Memory  management  including  storage  management  and  data 

management , 

4*  Support  of  the  PMP  FORTRAN  programs  for  functions  that 

cannot  be  performed  in  problem  mode  because  of  overall 

system  iinpl  icat  ions  / 

5.  Support  of  other  functions  of  the  Support  Processor-FMP 

interface  such  as  performance  monitoring,  error  logging 

and  operator  control , 

6*  Support  of  the  external  environment  including  interrupt 

handling,  I/O  handling,  peripheral  control  and  data 

communicat ions , 

7*  Providing  certain  system  utilities  such  as  dump,  and 

system  log  analyzer, 

8.  Support  of  diagnostics  and  maintenance  for  all  parts  of 

the  system* 

The  development  of  a system  of  this  magnitude  is  a major  task. 
During  the  study  of  the  feasibility  of  the  NASF,  the  MCP  con- 
sidered was  based  on  the  existing  MCP  on  Burroughs  700  series  and 
800  series  systems,  in  particular  the  B7800.  The  MCP  of  this 
system  has  evolved  from  systems  as  early  as  1960  and  is, 
therefore,  a mature  system  which  would  need  no  modification  to 
satisfy  many  of  the  above  requirements.  Recently,  Burroughs  has 
been  developing  the  Burroughs  Scientific  Processor  (BSP)  as  an 
attached  processor  to  the  B7800.  *rhe  general  philosophies  of  job 
flow  and  task  management  in  the  NASF  and  BSP  are  very  similar* 
The  MCP  described  below  is  therefore  based  on  some  of  the  design 
decisions  and  experience  gained  in  the  BSP  project. 
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4.3.1  Assumptions 

The  evaluation  of  the  proposed  MCP  implementation  is  based  in  part 
on  the  assumption  that  the  FtAV  would  be  designed  to  operate  most 
efficiently  on  tasks  with  the  following  characteristics. 

1.  Data  areas  up  to  the  size  of  the  extended  memory  (34 
million  words). 

2.  Long  running  programs:  a minimum  runtime  of  at  least  one 

second,  a typical  runtime  of  several  minutes  to  several 
hours. 

3.  Batch  job  oriented:  user  interaction  is  not  required. 

Also,  as  discussed  in  Chapter  2,  a self-managed  file  system  sup- 
ports the  basic  data  management  functions.  This  file  system  is 
assumed  to  not  only  provide  the  necessary  data  storage  and  retrie- 
val functions,  but  would  also  maintain  and  enforce  data  ownership 
and  access  control. 

4. 3. 1.1  Computational  Envelope 

An  PMP  task,  once  started,  is  assumed  to  run  to  completion  within 
the  high-performance  computational  and  I/O  environment  o£  the  PMP 
without  requiring  intervention  of  or  access  to  the  support  process- 
or or  any  of  its  I/O  devices.  The  computational  envelope  is  the 
high-performance  environment.  In  particular: 

1.  All  PMP  program  and  data  files  are  assumed  to  be  fully 
contained  within  DBM  while  the  program  is  in  operation. 
All  files  holding  the  necessary  input  are  assumed  to  be 
within  the  Data  Base  Memory  (DBM)  before  the  task  is 
started. 

2.  Each  PMP  program  is  self-contained  as  far  as  resources 
are  concerned.  No  dependencies  on  Support  Processor 
actions  shall  occur  during  the  runtime  of  the  program. 
Therefore,  no  operator  or  user  interaction  would  be 
permitted  during  execution  of  an  PMP  program.  Operators 
and  users  would  be  able  to  query  the  MCP  regarding  the 
status  of  the  job  running  on  the  PMP  and  would  have 
normal  controls  such  as  cancel  or  suspend  execution. 

4.3.2  B78Q0  MCP 

The  existing  B7800  MCP  actually  provides  more  functions  than  re- 
quired for  the  Support  Processor  System  of  the  NASP.  Only  those 
sections  which  are  of  major  importance  to  the  NASP  MCP  are  summar- 
ized below. 
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4. 3.2.1  Interrupt  Handling 


I 
I 

j The  B7800  style  systems  being  considered  for  the  Support^  Processor 

1 are  all  interrupt-driven.  Ihe  interrupt  handling  section  inter- 

faces with  all  the  resource-handling  parts  of  the  MCP.  Interrupts 
are  caused  by  the  B7800  CPU  by  the  I/O  Processor  and  by  software, 
j Some  of  the  interrupts  processed  by  the  interrupt  handler  are: 

1.  Caused  by  B7800  CPU 

a.  Interval  Timer 

b.  Presence  Bit  not  set  (part  of  automatic  memory 
management) 

c.  Invalid  Operand 

d.  Invalid  Index 

e . Processor-to-Processor  Communicat ions 

2.  Caused  by  B7800  I/O  Processor 

a.  Operator  Request  Pending 

b.  I/O  Complete 

c.  Data  Coram.  Processor  Ready-To-Send 

3.  Caused  by  Software 

a.  Inter-Task  or  Intra-Job  Communication 

i 4. 3.2.2  Memory  Management 

Memory  management  methods  supported  by  the  B7800  MCP  are  designed 
I for  implementation  of  the  “virtual  memory”  concept  within  the 

B7800.  Several  methods  of  memory  allocation  are  supported  on  the 
B7800.  These  methods  include  [ill: 

I 

^ 1.  On  demand 

2.  Working  set 
j 3 . SWAPPER 

All  methods  use  disk  as  the  backup  storage  device. 

I 4. 3.2. 3 MCP  I/O  Handling 

Since  the  MCP  is  involved  in  all  I/O  to  and  from  devices  attached 
j to  the  B7800,  the  MCP  I/O  handling  functions  are  re-entrant  code 

I shared  by  all  tasks  running  in  the  B7800  system.  These  I/O  pro- 

cedures  perform  the  following  functions; 

j 1.  Build  the  control  words  necessary  to  do  a physical  I/O 

operation 

2.  Send  I/O  instructions 

!3.  Wait  for  an  I/O  operation  to  complete 

4.  Notify  the  associated  program  to  continue 

5.  Handle  physical  I/O  errors 
a.  Retry  where  possible 

b.  Enter  user  error  routine  if  declared,  or 

c.  Discontinue  the  program 

1 
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4. 3* 2. 4 Process  Control 


The  job  selection  process  within  the  B7800  MCP  considers  the 

priority  declared  by  the  user,  the  time  the  process  has  been  wait-  ^ 

ing,  and  the  "class”  (or  system-level  priority)  of  the  task.  The 

process  control  section  supports  the  following  functions  in  the 

B7800. 

1.  Inititation  of  tasks  required  by  the  user  or  by  the  MCP 
2*  Task  scheduling 

3.  Perform  ”EOT/EOJ”  duties  such  as  deallocation  and 
bookkeeping  at  End  of  Task  or  Job 

4.  Make  administrative  log  entries 

4. 3. 2. 5 Peripheral  Control 

Peripheral  Control  procedures  of  the  MCP  are  responsible  for  all 
peripheral  devices  on  the  B7800,  except  disk.  These  procedures 
perform  the  following  functions: 


1.  Locate  input  data  files 

2.  Assign  output  devices  based  on  availability 

3.  Maintain  and  update  table  of  all  available  units 

4.  Handle  I/O  parity  recovery  such  as  tape  parity  and  card 
reader  errors 

5.  Maintains  system-level  status  such  as  ready,  repair,... 
for  all  physical  units  including  processors,  memories  and 
peripheral  devices. 

4. 3* 2. 6 Work  Plow  Management 

The  processing  of  the  tasks  within  a users  job  is  specified 

through  use  of  an  easy-to-use,  high-level  work  flow  control 
language  called  WFL  [12].  The  work  flow  management  software  on 
the  B7800  consists  of  a controller  (which  handles  most  keyboard 

input  messages  and  places  control  records  into  a Job  Description 
File) , a WPLCOMPILER  (which  generates  object  code  for  presentation 
to  the  Process  Control  Section  based  on  jobs  in  the  Job 

Description  Pile)  and  a job  formatter  (which  selectively  prints 
summary  information  about  the  job  on  the  Job  Summary  sheets)  ♦ 
Most  operator  keyboard  messages  are  handled  through  the  controller 
portion  mentioned  above. 

4. 3. 2. 7 Data  Communications 

The  data  communications  section  of  the  B7800  MCP  is  called  the 

Data  communications  Controller  (DCC).  The  functions  of  the  DCC 
include: 

1.  Allocation  and  deallocation  of  Data  Communications  Queues 
which  are  the  interface  mechanism  between  object  pro- 
grams, system  routines  such  as  the  editors,  and  tlie  DCC. 
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3*  Dynamic  reconfiguration  of  the  Data  Communications 
Subsystem 

4*  Generation  and  Maintenance  of  tables  used  by  the  Data 
Communications  Processors 

A system  called  NDL  (Network  Definition  Language)  [14]  provides  a 
user-oriented  means  of  specifying  network  and  terminal 
characteristics  as  well  as  what  processing  must  be  performed 

during  I/O  to  match  the  terminal  or  network  characteristics  to  the 
standard  forms  processed  in  the  system* 

4*3*3  Integrat i on  of  FMP  Task  Management  into  MCP 

PMP  programs  would  exist  as  tasks  within  the  standard  WFL  (Work 

Plow  Language  job  structure  of  the  B7800.  The  B7800  portion  of 
the  MCP  schedules  the  FMP  task  to  be  staged  into  the  FMP*  Once 
such  a task  is  initiated/  it  would  run  wholly  within  the 

computational  envelope  without  any  further  B7800  dependence  until 
the  task  terminates*  The  B7800  portion  of  the  MCP  may# 

optionally/  guery  the  status  of  FMP  tasks,  or  override  the  FMP 
task-selection  decisions* 

4. 3* 3*1  Limitations 

Some  functions  traditionally  associated  with  operating  systems  are 
not  provided  on  the  FMP  even  though  they  are  a normal  part  of  the 
B7800  itself.  Specifically? 

1*  FMP  FORTRAN  is  the  only  language  provided. 

2*  Interactive  programs  are  not  supported* 

3.  No  provision,  other  than  direct  I/O,  will  be  made  for 
programs  whose  total  file  sizes  exceed  memory  capacity. 

4.  Delays  due  to  waiting  for  operator  intervention  on  behalf 
of  executing  FMP  programs  would  be  eliminated. 

The  data  base  sizes  expected  are  very  large.  If  a job  with  a 
large  number  of  very  short  jobs  with  large  data  bases  is  encoun- 
tered, the  file  system  and  paths  to  and  from  the  DBM  would  become 
a bottleneck.  If  this  occurs,  efficient  utilization  of  the  FMP 
would  become  difficult. 

4* 3. 3. 2 Interrupt  Handling 

The  ifiterrupt  handling  section  of  the  existing  MCP  would  be  modi- 
fied o include  those  interrupts  caused  by  the  FMP.  The  major 
inter.’  upts  from  the  FMP  would  be  ”Task  Complete”  and  ” Error  State 
Pendiig".  Task  Complete  would  be  a normal  FMP  task  completion 
report*  This  response  would  be  passed  on  to  the  Work  Flow  Manage- 
ment section  to  determine  what  task  to  process  next.  The  Error 
State  Pending  would  be  the  report  of  an  abnormal  teimination. 
Status  information  would  have  to  be  scanned  out  of  the  FMP  to 
determine  whether  the  problem  is  user-related  (such  as  overflow) 
or  hardware  related  (such  as  a failure  in  that  portion  of  the 
system  which  is  not  automatically  corrected)* 
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4 . 3 . 3 • 3 Memory  Management 


No  change  would  be  made  to  the  B7800  Memory  Management  section  of 
the  MCP*  However/  input  data  and  program  staging  would  have  to  be 
initiated  by  the  B7800  MCP  for  PMP  destined  jobs*  Only  the  re- 
quests need  to  be  made.  The  File  System  actually  performs  the 
function. 

4. 3. 3. 4 Process  Control 

The  process  control  section  of  the  B7800  MCP  would  be  extended  to 
support  the  scheduling  and  initiation  of  tasks  on  the  FMP.  The 
process  control  section  would  also  maintain  FMP  log  entries  and 
statistics  with  respect  to  workload,  job  lengths,  etc. 

4. 3. 3. 5 Work  Flow  Management 

Extensions  in  the  B7800  WFL  (Work  Flow  Language)  would  provide  the 
following  functions; 

1.  Invoke  the  FMP  FORTRAN  compiler  and  linker. 

2*  Specify  FMP  resource  requirements  for  scheduling  and 

allocation  purposes  (such  as  the  amount  of  DBM  buffer 
area  required  during  FMP  task  exectuion) . 

3.  Specify  job  restart  point  following  failure  of  any 
portion  of  the  system* 

In  addition,  the  existing  work  flow  management  functions  which 
support  operator  control  of  jobs  and  tasks  in  the  system  would  be 
extended  to  include  tasks  running  on  the  FMP.  These  extensions 

would  include  static  controls  to  give  visibility  of  the  status  of 
a task  either  queued  or  active  on  the  FMP.  In  addition,  the  exten- 
sions would  provide  means  for  an  operator  to  alter  the  priorities 
of  tasks  queued  for  service  and  even  to  force  a roll-out  of  an 
active  task  (for  later  resumption).  Such  a roll-out  would 
normally  be  only  to  the  Data  uase  Memory. 

4. 3.3. 6 Utilities 

Various  utilities  specifically  oriented  to  the  support  of  FMP 
operations  would  be  developed.  These  utilities  would  include 
various  "analyzer”  utilities  to  edit  and  format  dumps. 

4.3.4  FMP  Portion  of  MCP 

A portion  of  the  NASF  MCP  would  be  resident  in  the  FMP.  In  partic- 
ular, the  coordinator  is  the  part  of  the  FMP  which  would  execute 
the  FMP  portion  of  the  MCP.  The  functions  provided  would  include; 

1.  Interface  to  the  Suport  Processor  for  FMP  initialization, 
operator  control,  task  forwarding,  checkpoint/restart, 
dumps,  etc* 

2.  Schedule  and  initiate  tasks  on  the  FMP  from  among  those 
forwarded  from  the  Support  Processor.  Provide  wrap-up 
for  normal  and  abnormal  termination. 


3.  Establish  connection  between  an  active  program  executing 
on  the  FMP  and  the  appropriate  files  in  the  Data  Base 
Memory. 

4.  Service  FMP  interrupts  such  as  invalid  operand  or  errors. 

5.  Provide  the  appropriate  run-time  environment  for  FMP 
FORTRAN  execution.  This  environment  would  include  the 
appropriate  intrinsics  plus  mechanizations  of  time,  date, 
PAUSE  and  dump.  The  run-time  environment  would  also 
support  code  overlay  mechanisms,  space  allocation#  and  job 
roll-in  and  roll-out. 

4.3.5  File  Management 

An  independent  file  manager  provides  transparent  management  of  all 
files  on  archive,  disk,  and  in  the  Data  Base  Memory  (DBM).  This 
file  manager  is  accessible  from  the  FMP,  the  Support  Processor, 
and  the  Users.  Thus,  the  file  management  system  will  have  capabil- 
ities exceeding  those  required  only  to  support  FMP  execution. 

One  of  the  functions  of  the  file  management  system  will  be  to 
accept  commands  designating  movement  of  or  copying  of  files  from 
one  place  to  another.  Those  commands  would  be  utilized  to  init- 
iate the  movement  of  programs  and  input  data  to  the  Data  Base 
Memory  as  needed  for  FMP  execution. 

The  Data  Base  Memory,  and  its  controller,  are  considered  part  of 
the  Pile  System  portion  of  the  NASF  although  the  sole  purpose  of 
the  DBM  is  as  a staging  memory  of  FMP  jobs  and  data.  Since  the 
DBM  is  part  of  the  File  System  and  since  the  File  System  provides 
data  and  storage  allocation  capabilities,  the  portion  of  the  MCP 
on  the  FMP  does  not  require  any  f ilemanagement  capabilities. 

Another  of  the  functions  of  the  DBM  will  be  to  allow  certain 
functions  to  be  externally  enabled.  The  best  example  of  this 
capability  would  be  a request  to  the  File  System  by  the  Work  Flow 
Management  portion  of  the  MCP  (executing  on  the  Support  Processor) 
to  cause  the  movement  of  result  files  of  a particular  FMP  job  back 
to  the  active  files  from  the  DBM.  This  request  could  be  made 
contingent  on  a message  from  the  FMP  portion  of  the  MCP  to  the  DBM 
controller  that  che  result  files  are  closed  and  can  be  released. 

Other  functions  to  be  provided  by  the  file  management  system  will 
include: 

1.  Dynamic  allocation  and  deallocation  of  space  as  required. 

2.  Establishment  and  maintenance  of  directories  or  other 

techniques  to  map  external  requests  (which  will  be  in 

terms  of  the  “name”  of  a file)  to  the  appropriate  physical 
storage  area. 

3.  Backup  and  archiving  of  files  based  on  specified  condi- 

tions or  time  intervals. 

4.  File  Security  functions  which  would  allow  user  control 

over  which  programs  and/or  users  would  be  allowed  to  read 
and/or  update  their  files. 


4* 3* 5*1  PMP  Interaction  with  Pile  Subsystem 

Since  the  file  system  is  self-managed,  all  references  to  data 
within  the  file  system  would  be  by  name  of  the  data  rather  than  by 
direct  reference  to  its  physical  position.  PMP  interaction  with 
the  file  system  occurs  at  two  levels  of  the  system.  First,  the 
coordinator  provides  the  high-level  interface  to  the  file  system, 
in  particular  to  the  Data  Base  Memory  Controller.  Second,  the 
Data  Base  Memory  is  part  of  the  Pile  System,  and  as  such  has  an 
operational  interface  to  the  Pile  System  Manager  and  the  rest  of 
the  file  system. 

The  operational  interface  between  the  DBM  and  the  rest  of  the  Pile 
System  provides  the  required  data  paths  as  well  as  control  paths 
to  support: 

1.  movement  of  files  within  the  file  system 

2.  storage  allocation 

3.  security  functions 

The  interface  between  the  coordinator  and  the  DBM  has  basically 
the  same  functions  as  interfaces  between  the  file  system  and  other 
NASP  subsystems  such  as  the  Support  Processor  and  the  Users. 
Allocation  of  space  within  the  Data  Base  Memory  is  controlled  by 
the  File  System,  not  by  application  programs.  The  DBM  maintains  a 
table  to  convert  file  names  into  DBM  addresses.  Thus,  the  files 
referenced  by  the  coordinator  are  referenced  by  name  rather  than 
by  physical  location. 

Control  of  the  files  within  the  DBM  follows  the  philosophy  of  the 
rest  of  the  file  system.  Once  a particular  file  has  been  opened 
by  an  external  request,  that  file  is  frozen  as  far  as  allocation 
is  concerned  and  remains  resident  (for  example  in  the  DBM  where 
coordinator  requests  are  concerned)  until  closed.  The  coordinator 
would  have  the  capability  of  initiating  a transfer  from  DBM  to  EM 
very  similar  to  a DMA  (Direct  Memory  Access).  Such  a transfer 
identifies  the  name  of  the  file  in  the  DBM  and  the  length  and 
physical  location  of  the  EM  area  reserved  for  the  transfer. 

CtJeration  over  this  interface  can  be  summarized  as  follows.  When 
an  PMP  task  has  been  requested  (in  the  Support  Processor),  the  Sup- 
port Processor  passes  the  names  of  the  files  needed  to  start  the 
task  to  the  file  system.  In  addition,  the  FMP  portion  of  the  MCP 
is  notified  of  the  expected  arrivals  and  an  entry  would  be  made  on 
a queue  of  "pending”  job  requests.  In  the  meantime,  the  file 

system  would  be  busy  transferring  the  requested  files  to  the  DBM. 
When  the  job  currently  executing  on  the  FMP  completes  and  its 
files  are  closed,  the  file  system  begins  transferring  those  files 
back  to  the  bulk  storage  regions.  At  this  time,  the  coordin- 

ator, under  control  of  the  pending  task  list,  takes  those  steps 
needed  to  initiate  execution  of  the  next  task  for  which  all  re- 
quired files  are  resident  in  the  DBM.  This  task  scheduling  re- 
quires that  the  status  of  the  file  loading  into  the  DBM  be  avail- 
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able  to  the  coordinator*  To  begin  the  startup  of  a job/  the  co- 
ordinator would  then  open  the  program  code  file  and  request  that 
it  be  transferred  to  some  specific  area  of  the  EM,  Other  files 
used  for  standard  system  monitoring  would  be  opened  at  the  same 
time.  The  PMP  task  would  begin  execution  after  the  coordinator 
completed  broadcast  of  the  code  files  to  the  processors. 

Not  all  files  would  wait  to  the  end  of  an  PMP  run  to  be  unloaded 
from  the  FMP.  The  Support  Processor  would  be  able  to  specify  the 
destination  of  expected  DBM  output  files  prior  to  completion  of 
FMP  task  execution*  The  file  system  would  then  provide  automatic 
staging  out  of  the  DBM  once  the  file  of  interest  is  closed.  More 
discussion  related  to  this  area  can  be  found  in  Section  5,9  (DBM 
Controller) , 

4*3,6  Job  Structure 

A job  is  the  only  unit  of  work  in  the  NASF,  The  job  is  itself  a 
very  simple  program  which  invokes  and  determines  the  relative 
sequence  of  a set  of  programs.  These  programs  constitute  a set  of 
logically  related  tasks  which  perform  some  data  transformation  on 
files*  A job  is  written  in  FMP  Work  Flow  Language  (WFL)  and  it 
runs  on  the  Support  Processor  (B7800)*  FMP  WFL  contins  B7800 
standard  WFL  as  a proper  subset,  so  any  existing  B7800  (or  B7700) 
job  can  run  unmodified  on  the  NASF*  The  WFL  commands  are  either 
simple  action  commands  (RUN,  COMPILE)  or  tests  of  conditions  (IF 
SUCCESSFUL-COMPILE  THEN*  * * ) • 

4* 3* 6*1  Organization  of  a Job 

The  basic  outline  of  a typical  job  is  constrained  by  the  computa- 
tional envelope  and  LINKER  concepts  (see  Section  4*3*7),  The 
typical  job  will  contain,  in  sequence: 

1*  None,  one  or  more  FMP  FORTRAN  compilations 
2*  If  there  is  a compilation,  a LINKER  task 
3*  Specification  of  necessary  input  files  for  PMP  program 
4,  One  or  more  executions  of  PMP  programs 

In  addition,  any  number  of  B7800  tasks  may  be  interspersed  with 
the  above,  such  as  to  generate  input  files,  or  to  process  output 
files. 


4. 3. 6. 2 Flow  of  Job 


Figure  4# 13  shows  a general  view  of  the  flow  of  a job  in  the  NASP. 
A job  enters  at  the  upper  left  { BO J«Beg inning  of  Job)*  First  the 
job  itself,  the  Work  Flow  Language,  must  be  analyzed  so  the  job  is 
scheduled  and  finally  analyzed  on  the  CPU*  The  result  is  a 
JOBPILE  which  controls  the  sequencing  of  the  rest  of  the  tasks  in 
the  job.  If  FORTRAN  compilations  and  LINKER  tasks  are  requested, 
control  remains  on  the  left  of  the  figure.  VVhen  an  FMP  task  is 
specified,  that  request  together  with  the  identification  of  any 
files  needed  is  passed  to  the  Pile  System  (upper  right  of  figure). 
Once  all  the  files  have  been  staged  into  the  DBM,  the  task  is  plac- 
ed READY  for  FMP  execution  (lower  right  of  figure).  Once  the  FMP 
task  is  complete,  the  Support  Processor  is  notified  so  that  the 
next  task  specified  in  the  Work  Flow  can  be  specified*  When  all 
tasks  are  complete,  the  job  terminates  (EOJ-End  of  Job- at  lower 
left  of  figure). 
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Figure  4 , 13  NASF  Job  Flow  Diagram 


^ ^ • 7 Program  Load  and  Overlay  Support 

*Jhe  FMP  evaluated  would  run  only  one  program  at  a time*  No  addi- 
tional program  or  data  area  may  be  preloaded  into  the  EM  or  pro- 
cessor memories*  Although  preloading  might  minimize  setup  delays 
when  starting  the  next  task,  additional  hardware  would  be  required 
to  support  the  desired  level  of  security.  The  Data  Base  Memory 
and  its  controller  allow  preloading  of  programs  and  data.  Secu- 
rity can  be  better  maintained  at  this  level  since  all  references 
to  data  in  the  DBM  is  by  descriptor  (or  name). 

The  LINKER  accepts  object  code  files  from  one  or  more  separate 
FORTRAN  compilations  and  produces  a single  load  code  file,  called 
the  loadf ile*  In  the  process,  the  LINKER  assigns  memory  locations 
to  all  program  instructions  and  resolves  or  relocates  address 
references  accordingly* 

For  the  case  that  the  program  memory  part  of  the  user  program  is 
too  Large  (i*e*  would  not  fit  within  the  processor  meraory)  , the 
LINKER  supports  an  overlay  facility*  With  this  mechanism,  the 
user  may  divide  a program  into  multiple  phases  and  then  may 
specify  which  phases  share  the  same  memory  locations* 

For  the  case  that  the  data  part  of  the  user  program  is  too  large, 
the  user  may  use  the  direct  I/O  facilities  to  and  from  files  in 
the  DBM.  Automatic  virtual  memory  mechanisms  were  not  suggested 
for  this  system  since  the  applications  considered  during  the  study 
did  not  require  such  mechanisms.  If  a significantly  different 
workload  and  application  for  the  system  is  expected  (than  the 

applications  studied),  the  cost-benefit  tradeoffs  should  be  re- 
evaluated. 

Data  is  either  initialized,  uninitialized,  or  initialized  to 

“invalid”.  Initialized  segments  have  their  initial  contents 
present  in  DBM  as  generated  by  the  Compiler/Linker.  Uninitialized 
segments  and  segments  to  be  initialized  to  "invalid”  are  not 

prese  it  on  the  DBM.  In  this  case,  storage  is  initialized  by  the 
execution  of  approprate  FMP  code. 

4.3*8  Operations  Support 

4.3. 8.1  Performance  Monitoring 

Certa  n information  will  be  monitored  during  NASF  operations, 

collected,  and  reported  as  part  of  the  system  log*  Some  of  this 
information  is  accumulated  by  the  B7800  as  part  of  normal  monitor- 
ing in  the  existing  MCP.  Other  information  would  be  collected  by 
the  FMP  portion  of  the  MCP.  Some  of  the  information  that  may  be 
included  in  such  monitoring  is: 
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1.  Interval  timer  reading  at  the  time  of  the  report 
2#  Real  time  clock  at  time  of  the  report 

3.  Count  of  CN^using  instructions 

4.  Some  measure  (to  be  detennined)  of  the  processor  idle 
time 

5*  A measure  of  the  time  that  the  coordinator  only  is  busy 
(i*e*  all  processors  idle) 

6.  Count  of  succcessful  error  corrections 

7.  For  each  error  correction,  the  address  and  the  observed 
pattern 

8*  Time  spent  in  specific  subroutines 
9*  Others  to  be  determined 

The  interval  timer  in  the  coordinator  would  be  coordinated  with 
the  Support  Processor  at  the  beainning  of  a run* 

Other  monitoring  in  the  TOP  would  be  task  related.  Beginning-of- 
Task  and  End~of«Task  of  PMP  tasks,  OPEN  and  CLOSE  of  DBM  files, 
and  traffic  to  and  from  the  DBM  would  be  logged.  Operator  console 
system  status  display  would  be  extended  to  include  FMP  tasks. 

4. 3. 8. 2 System  Initialization 

PMP  initialization  is  that  process  whereby  the  FMP  is  transformed 
from  an  indefinite  (i.e.  any  arbitrary)  state  into  a state  in 
which  it  normally  processes  user  programs.  This  process  reinitial- 
izes all  parts  of  the  system.  Conceptually,  the  initialization 
process  corresponds  to  a coldstart  where  not  only  is  the  MCP 
loaded,  but  all  tables,  directories,  etc.  are  initialized. 

Initially,  no  process  corresponding  to  a **coolstart”  (where  the 
disk  directory  is  saved)  or  to  a “halt  load”  (where  jobs  are 
restarted  from  the  last  inactive  point)  will  be  implemented.  Re- 
start in  the  face  of  failures  needs  to  be  carefully  studied  since 
there  seems  to  be  a number  of  natural  points  at  which  execution 
could  resume  after  a failure  without  having  to  reinitialize.  In 
particular,  while  executing  all  the  instances  of  a DOALL,  if  one 
processor  failed,  only  those  instances  assigned  to  that  processor 
would  have  to  be  recomputed  in  the  spare  processor.  Since  the 
ENDDO  would  have  occured  without  successful  completion  of  all 
instances,  the  old  values  from  the  start  of  the  DOALL  would  still 
be  available.  Careful  analysis  of  this  sort  of  a circup^stance  may 
show  other  “natural"  retry  points  in  the  system. 

Initialization  of  the  FMP  itself  consists  of  the  following  steps; 

1.  The  driver  program  (executing  on  the  Support  Processor) 
determines  that  the  B7800  - FMP  connection  is  operation- 
al. This  connection  is  a low  bandwidth  connection  via  the 
Diagnostic  Controller  (DC)  part  of  the  FMP  and  the  Data 
Comm  Controller  on  the  B7800. 

2.  The  driver  transfers  the  FMP  portion  of  the  .'ICP  to  the 

coordinator  via  the  DC.  The  coordinator  tlien  begins 

execution  of  its  part  of  the  MCP. 
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3.  An  initialization  phase  of  the  PMP  MCP  will  perform 
various  initialization  functions/  including  confidence 
tests# 

4.  The  MCP  will  then  complete  its  initialization  and  inform 
the  Support  Processor# 

The  PMP  is  then  ready  to  process  programs# 

4# 4 OTHER  SOFTWARE  REQUIREMENTS 

Although  the  PMP  FORTRAN  language  and  compiler/  and  the  NASP 
Master  Control  Program  (MCP)  are  the  key  elements  of  the  NASP/  a 
number  of  other  software  capabilities  and  requirements  exist* 
These  capabilities  and  requirements  might  be  classified  as  those 
which  are  supportive  to  the  language  and  MCP  developments  and  as 
those  which  may  provide  more  general  utility  of  the  system# 

To  support  the  language,  software  development  cannot  stop  with  the 
compiler  (both  a prototype  version  and  a more  final  version)#  In 
addition,  a system  development  language  must  be  identified  to 
support  the  development  of  the  operational  environment#  Input  *- 
Output  Formatting  routines  would  need  to  be  developed,  especially 
if  a final  review  of  the  impact  of  various  system  scenarios  show 
that  the  Support  Processor  would  then  be  the  appropriate  system 
resource  to  provide  all  I/O  support#  The  program  library  and 
overlay  facilities  that  may  be  desirable  would  be  supported  by  a 
LINKER  or  BINDER# 

Those  jobs  in  execution  on  the  NASP  will  need  to  be  able  to  util- 
ize various  intrinsics,  some  of  which  will  be  resident  on  the  PMP. 
These  intrinsics  would  include  PMP  task  initialization  (including 
EH  and  PM  loading),  run-time  execution  monitoring,  and  mathe- 
matical intrinsics  # 

Some  of  the  simulation  support  that  would  be  needed  in  the  develop- 
ment of  the  NASP  could  be  based  on  work  done  as  part  of  this 
study.  Simulators  at  various  levels  would  be  utilized,  including: 

NASP  block-level  simulation 

PMP  simulation  for  timing  estimates 

Functional  simulators  for  early  code  development  support 

Another  important  area  of  software  would  be  the  systems  developed 
to  support  the  diagnostics  and  maintenance  of  the  NASP  (which  are 
discussed  in  more  detail  in  Chapter  6)#  These  software  tools 
would  include? 

Off-line  PMP  diagnostics  which  would  be  initiated  by  the 
Support  Processor  and  exercise  the  PMP  when  no  jobs  were 
active# 

On-line  processor  diagnostics  to  be  used  both  as  part  of  the 
off-line  PMP  diagnostics  above  and  as  a means  of  testing  the 
spare  processors  when  not  actively  assigned  to  user  problems. 
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Automatically  managed  FMP  confidence  tests 

Diagnostic  generation  tools  to  be  available  both  during 
development  and  initial  test  of  the  systerar  and  also  as  a tool 
to  allow  the  Field  Engineer  to  produce  new  tests  as  required# 

All  standard  diagnostics  and  maintenance  tools  provided  as 
part  of  any  standard  equipment  included  in  the  NASF# 

Tester  Software 

In  addition  to  the  above  capabilities^  most  of  which  must  be 
developed  specifically  Cor  the  NASF,  software  already  exists  for 
that  portion  of  the  system  which  may  be  implemented  with  standard 
products.  For  the  B7800  Support  Processor,  a complete  set  of 
languages,  utilities,  and  application  packages  exist  including: 

ALGOL 
PL/ 1 
FORTRAN 
COBOL 

BINDER  (linker) 

CANDE  (a  text  editor) 

WORK  FLOW  MANAGEMENT  (operating  system) 

NETWORK  DEFINITION  LANGUAGE  (for  communications  control) 

4.5  CONCLUSIONS 

The  implementation  of  a system  such  as  the  NASF  is  a major  under-- 
taking.  However,  the  software  portion  of  the  system  studied  is  a 
realistic  task  to  approach  since  it  can  be  based  in  large  part  or. 
existing  software.  The  major  part  of  the  operating  system  exists, 
including  the  techniques  to  control  an  "attached  processor"  with  a 
computational  envelope  supporting  one  user  at  a time. 

The  language  extensions  would  be  straight  forward  to  implement. 
Since  the  extensions  are  strongly  biased  to  description  of  the 
problems  rather  than  explicit  mapping  to  the  hardware  and  since 
the  architecture  reflects  the  structure  of  the  problems,  the  nec- 
essary flexibility  exists  to  allow  growth  and  improved  efficiency 
over  the  future  of  the  NASF. 
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CHAPTER  5 


PLOW  MODEL  PROCESSOR  (PMP)  HARDWARE 


5 a INTRODUCTION 

This  chapter  contains  the  results  of  the  past  yearns  study  with 
respect  to  the  design  of  the  Plow  Model  Processor  (PMP)  hardware. 

In  significant  areas , the  PMP  design  presented  here  is  substan- 
tially more  flexible  and  more  general  purpose  than  the  PMP  design 
of  Ref.  1*  Whereas  that  PMP  was  tailored  to  be  efficient  on 
programs  that  could  be  vector ized^  with  some  extention  to  the  case 
where  the  data  did  not  form  vectors,  the  current  PMP  performs 

essentially  just  as  efficiently  whether  the  data  can  be  arranged 
in  the  form  of  vectors  or  not.  In  the  present  PMP,  the  512  pro- 
cessors can  work  together  efficiently  as  a vector  machine;  they 
can  be  just  as  efficient  when  working  as  512  independent  scalar 
processors. 

The  PMP  is  capable  of  execution  in  a manner  similar  to  lock-step 
array  machines  such  as  ILLIAC  IV  or  the  Burroughs  Scientific 

Processor  (BSP).  Simple  programs  (a  copy  resident  in  each 
processor),  with  no  data-dependent  branching,  will  produce  this 
result.  The  PMP  is  not  .limited  to  this  mode  of  execution  however. 
It  is  also  capable  of  performing  in  the  manner  of  conventional 
multiprocessors.  Interprocessor  synchronization  is  implemented 
via  special  commands  and  use  of  the  shared  memory  (Extended 
Memory) . 

It  is  expected  that  the  multiprocessor  capabilities  of  the  PMP 
would  be  used  on  array-oriented  problems.  In  particular,  all 

processors  are  cooperating  on  the  same  job,  with  each  processor 
independently  executing  some  small  portion  of  the  job.  In  this 
mode  of  execution  it  becomes  important  to  have  as  small  a time 
penalty  as  possible  when  synchronization  of  the  processors  is 
required.  The  coordinator  gives  the  PMP  the  ability  tc  do 

array-wide  synchronizations  in  one  instruction. 

The  result  is  an  architecture  that  is  much  more  flexible  than  the 
current  generation  of  high-performance  processors,  in  that  there 
is  no  requirement  to  vectorize  the  algorithm.  It  is  also  easy  to 
put  a great  many  processors  to  work  on  a single  algorithm  because 
of  the  degree  of  interprocessor  cooperation  available  through  the 
coordinator  and  the  common  Extended  Memory.  Although  the  aero  flow 
codes  are  dominated  by  vector izable  algorithms,  there  are  por- 
tions, such  as  subroutine  CHARAC  in  the  explicit  aero  code,  where 
the  data  dependency  is  different  from  processor  to  processor , and 
the  independent  execution  of  each  processor  simplifies  matters 
greatly.  The  radiation  and  physics  computations  of  the  weather 
codes  use  the  independence  of  the  processors  to  an  even  greater 
extent . 
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Some  of  the  more  Important  design  considerations  are  discussed  in 
the  following  subsection.  The  sections  following  in  this  chapter 
review  the  PMP  architecture,  briefly  list  the  system  parameters 
and  describe  each  of  the  major  elements  of  the  PMP  in  turn. 


5,1.1  Design  Constraints  and  Considerations 

During  the  course  of  major  hardware  development  project,  such  as 
the  PMP,  consideration  of  and  compromise  between  many  (sometimes 
conflicting)  requirements  must  be  made.  Some  of  the  important 
considerations  on  this  project  (throughput,  economy, 
hardware/software  compatibility,  and  schedule)  are  discussed 
briefly  below. 


5. 1.1.1  Throughput 

One  major  compromise  in  the  design  of  any  processor  is  between 
processor  performance  and  its  cost.  In  this  project,  the  point  of 
maximum  performance  per  unit  of  cost  is  identified  on  the  cost  vs, 
performance  curve  for  a single  processor.  Enough  of  those 

processors  are  built  to  deliver  the  requited  throughput.  This 
approach  contributes  to  maximizing  performance  vs.  cost  for  the  FMP 
as  a whole.  The  above  evaluations  result  in  the  choice  of 
high-speed  ECL  and  implementation  on  large  boards. 

5 . 1 . 1 . 2 Economy 

Although  those  sections  of  programs  which  are  vector izable  can  be 
conceptually  implemented  on  a processor  that  enforces  lock-step 
cooperation  among  all  the  processors,  the  hardware  required  to 
enforce  such  lock-step  operation  is  almost  missing  from  the  PMP. 
Each  processor  is  self-contained,  with  as  rudimentary  connection 
to  the  rest  of  the  machine  as  the  problem  requirements  will  allow. 
The  MIMD*  construction  of  the  machine  also  simplifies  the  soft- 
ware, both  in  terms  of  system  software  as  well  as  for  application 
orogram  development. 

5. 1.1. 3 Hardware/Software  Compatibility 

The  overall  economy  of  a system  is  directly  affected  by  the 
hardware  support  of  software  requirements.  In  some  cases  specific 
hardware  features  may  be  required  to  reduce  software  costs.  On 
the  other  hand,  when  hardware  features  are  not  required,  system 
costs  could  be  reduced  by  not  providing  these  features.  Some 

specific  considerations  on  this  project  include: 

(1)  The  FMP  has  only  one  user  program  resident  on  it  at  any 
one  time. 

(2)  Data  addresses  are  independent  of  code  locations.  Some 
degree  of  dynamic  run-time  data  allocation  is  done.  For 

*Multiple  Instruction  Stream,  Multiple  Data 
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example , space  local  to  a subprogram  is  allocated  upon  entry 
to  that  subprogram^  and  released  upon  exit^  using  a stack 
mechanism  for  allocating  space.  Space  is  allocated  to  a 
named  common  only  upon  entry  to  the  first  program  unit 
naming  that  common , and  is  deallocated  upon  exit  from  the 
last.  Integer  registers  are  used  as  stack  pointer,  and  as 
pointers  to  named  common  areas.  Many  machines  of  the  older 
generation  allocate  space  permanently,  even  during  those 
periods  that  the  FORTRAN  77  specification  declares  them  to 
be  undefined.  In  the  present  case,  that  will  reduce  the 
size  of  the  problem  that  can  be  handled.  For  example,  in 
the  implicit  aero  flow  code  BTRID  is  a large  named  common  in 
subroutine  BTRI,  and  subroutine  SMi^OTH  has  arrays  SS  and  CT. 

These  do  not  exist  concurrently,  so  processor  memory  can  be 
devoted  to  BTRID  during  the  execution  of  BTRI  and  to  SS  and 
CT  during  SMOOTH.  If  space  had  to  be  allocated  for  both  of 
these  all  the  time,  the  largest  allowable  BTRID  would  be 
substantially  smaller. 

(3)  Automatic  stack  pushing  and  popping  on  subroutine  entry  and  exit. 

(4)  A full  set  of  interrupts  both  at  the  processor  level  and 
the  coordinator  level. 

(5)  Requests  to  the  Data  Base  Memory  controller,  for  data  in 
Data  Base  Memory,  carry  the  name  of  the  file  involved,  not 
its  address. 


5. 1.1. 4 Schedule 

Historically,  every  two  years  worth  of  technological  development 
has  resulted  in  the  delivery  of  computers  that  are  about  three 
times  more  powerful  for  the  same  cost.  Thus,  adding  an  unneces- 
sary year  between  the  design  freeze  and  the  delivery  of  a computer 
amounts  to  using  technology  that  is  one  additional  year  toward 
obscolescence,  and  has  a penalty  of  a factor  of  3^  in  computa- 
tional horsepower.  This  trend  has  slowed  recently.  Even  so,  it 
is  important  to  use  straightforward,  low-risk  designs  to  achieve 
timely  delivery. 

5.2  FMP  ARCHITECTURE 

Figure  5.1  shows  general  organization  of  the  FMP.  The  major 
elements  are: 

(1)  512  Processors,  each  containing  a scalar  execution  unit 

and  storage  for  data  and  program, 


(2)  Connection  Network  used  to  interconnect  processors  and 
the  Extended  Memory, 

(3)  521  Extended  Memory  modules,  which  hold  the  main  data 
base  of  the  program, 

(4)  Data  Base  Memory,  used  as  a staging  area  for  jobs  to  be 
scheduled  and  as  a high-speed  input/output  buffer  for 
jobs  in  execution, 

(5)  Coordinator,  used  to  synchronize  the  processors,  to 
interface  to  the  Support  Processor,  and  to  run 
diagnostics,  and 

(6)  Diagnostic  Controller,  which  allows  direct  control  of 
fault  isolation  in  the  PMP  from  the  support  Processor* 

Each  processor  is  self-contained,  with  integer  and  floating-point 
arithmetic  units,  its  own  instruction  decoder,  its  own  program 
and  data  memory.  Four  extra  processors  are  included  as  on  line 
spares  to  help  achieve  system  availability  requirements.  In 
addition,  four  extra  Extended  Memory  modules  are  included  as  on 
line  spares,  again  to  help  achieve  system  availability 
requirements. 


5.2.1  General  Flow  Through  FMP 

During  normal  operation,  all  data  and  program  for  the  next  run 
will  be  loaded  into  data  base  memory  (DBM)  prior  to  the  beginning 
of  the  run.  The  DBM  loading  is  initiated  by  the  scheduler  in  the 
Support  Processor  via  the  Pile  System  Controller  (these  NASF 
system  elements  are  described  in  Chapter  7).  The  scheduler 
initiates  a run  on  the  FMP  through  interaction  with  the 
coordinator  (CR) . 

When  the  run  starts,  software  in  the  coordinator  initiates  the 
transfer  of  code  files  from  the  DBM  to  the  Extended  Memory  (EM). 
From  there  the  coordinator  causes  its  code  files  to  be  loaded  in 
its  memory  and  causes  the  Processor  code  files  to  be  broadcast  to 
each  Processor.  The  initialization  phase  of  the  program  (in  the 
coordinator)  then  transfers  necessary  data  to  EM.  These  actions 
are  automatically  inserted  by  the  compiler  and  the  linker.  With 
data  in  place  in  extended  memory,  and  allocated  space  optionally 
initialized  to  ''invalid",  and  with  code  files  in  place  in 
coordinator  and  processors,  user  execution  starts. 

When  user  execution  is  in  progress,  the  coordinator  serves  as  a 
high-level  "instruction  sequencer".  Processor  tasks  are 
explicitly  initiated  and  when  all  processors  complete  their  tasks 
(by  indicating  "I  got  here"),  the  coordinator  causes  the  next  task 
to  be  initiated  in  its  sequence. 
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5.2.2  Changes  from  Baseline  System 

The  Baseline  System  of  the  preliminary  study  (see  Ref.  1 and  Ref. 
2)  had  the  same  basic  organization  as  the  system  shown  in  Figure 
5.1.  The  major  difference  is  in  the  type  of  connection  between 
the  processors  and  the  extended  memory  and  in  the  system 
implications  of  that  connection. 

The  Baseline  System  proposed  use  of  a "Transposition  Network" 
which  allowed  flexible  access  of  vectors  and  array  components  from 
the  Extended  Memory.  The  "price"  of  this  vector--f  etching 
capability  was  that  the  processors  had  to  be  synchronized  at  each 
Extended  Memory  fetch  time  (accomplished  by  the  Control  Unit). 

The  modifications  proposed  during  present  feasibility  study  were 
to  relax  the  need  for  coordination  to  only  the  start  and  end  of 
concurrent/  independent  code  sections.  To  accomplish  this/  an 
alternative  scheme  to  interconnect  the  processors  and  memories  was 
proposed  which  is  called  the  Connection  Network  (CN).  The 
reduction  in  synchronization  requirements  had  the  side-effect  of 
greatly  simplifying  coordination  tasks.  These  simplified  tasks  are 
handled  by  a unit  now  called  the  Coordinator  (CR) * 

Evaluation  of  system  loading  has  resulted  in  some  proposed  changes 
in  bandwidth  between  FMP  components.  The  current  bandwidth  plans 
are  summarized  on  Figure  5.1 

5.2.3  Basic  System  Parameters 

No  major  changes  have  been  made  since  the  preliminary  study.  The 
choice  of  these  parameters  was  covered  in  detail  in  previous 
reports  (see  Ref.  1 and  Ref.  2).  Following  is  a summary  of  the 
basic  system  parameters. 

5. 2. 3.1  Logic  Family 

ECL  is  expected  to  be  the  preferred  logic  family.  If  the  final 
design  were  being  implemented  at  this  time,  Fairchild's  lOOK 
series  would  be  chosen  together  with  compatible  memory  circuits. 
FL.al  selection  of  a logic  family  will  be  deferred  to  the 
appropriate  point  in  the  design  cycle  in  order  to  gain  the  most 
effective/  low-risk  components. 

Chip  counts  were  made  assuming  chips  projected  to  be  available  in 
1980.  Confidence  in  this  count  is  supported  by  the  count  in  a 
comparable  processor  which  has  been  designed  using  circuit  types 
available  in  1978.  See  Appendix  E of  Reference  1 for  preliminary 
data  on  this  processor. 


5. 2. 3. 2 Clock  Rate 


The  clock  has  been  assigned  a 40-nanosecond  period*  The  instruc- 
tion times ^ given  in  Appendix  C are  in  terms  of  this  clock  period. 
These  times  are  compatible  with  the  instruction  times  derived  from 
the  processor  design  referenced  to  in  Appendix  E of  Ref.  1.  using 
ECL  lOOK. 

5.2* 3 .3  Cabling  Methods 

The  same  flat  belts  used  successfully  in  prior  projects  at 
Burroughs  for  transmitting  high-speed  signals  with  fast  rise  time 
and  low  crosstalk  will  be  used  for  most  of  the  inter  unit  cables. 
Reference  1 discusses  this  choice. 

5. 2. 3. 4 Power 

Power  and  grounding  design  details  are  discussed  in  detail  later 
in  this  chapter.  The  primary  design  considerations  are: 

(1)  A small  number  of  centralized  power  conditioning  modules 
that  accept  raw  power  from  the  mains, 

(2)  Switching  regulators  for  efficiency 

(3)  Defense  against  faults  in  the  incoming  power, 

(4)  Defense  against  faults  in  the  PMP, 

(5)  Noise  reducing  grounding  methods,  and 

(6)  Non-volatility  of  DBM  contents. 

5. 2. 3. 5 Number  of  Processors 

A key  decision  in  the  design  of  the  PMP  will  be  the  choice  of  the 
number  of  processors  to  be  implemented.  Having  designed  the  most 
cost-effective  processor,  then  a sufficient  number  of  them  are 
linked  together  to  produce  the  required  throughput  rate.  Having 
done  this,  and  found  that  512  processors  is  the  nearest  round 
number  to  match  the  areo  flow  requirements,  performance  analysis 
f-hen  confirms  that  this  approach  produces  a PMP  that  meets  the 
aero  flow  (and  weather)  requirements.  The  piocessor  design 
selected  is  one  that  matches  the  80ns,  16K-bit  by  one,  static  RAM 
chips  that  are  forecast  to  be  available  by  the  time  the  PMP  is 
being  designed.  This  is  a fairly  simple  ECL  processor,  with  40  ns 
clock  and  120  ns  memory  cycle. 

A faster  processor  might  allow  the  PMP  to  be  built  with  256 
processors.  This  requires  a faster  memory,  and  therefore  is 

projected  to  require  a smaller  (4K-bit),  faster  (30  ns)  memory 
chip.  The  result  is  a doubling  of  the  number  of  memory  parts 
required.  The  faster  processor  is  also  estimated  to  require  far 
more  logic  parts,  with  a net  increase  in  parts  count.  More  parts 
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implies  more  failures,  and  hence  a lowered  reliability.  Fewer 
processors,  however,  means  reduced  throughput  penalty  for  those 
parts  of  some  applications  where  concurrency  cannot  be  found,  and 
hence  some  extension  of  the  spectrum  of  applications, 

Pinal  decisions  will  be  postponed  to  take  maximum  advantage  of 
components  available  at  the  time  of  design.  For  example,  if  the 
16K-bit  chips  were  faster  than  here  forecast,  a faster  processor, 
but  only  256  of  them,  might  be  perferred,^  If  64K-bit  chips  were 
available  at  the  same  speed  of  the  16K-bit  chips  here  forecast, 
these  would  be  preferred  to  the  16K^bit  chips,  since  one  would  get 
twice  as  much  memory  with  improved  reliability  due  to  the  reduced 
parts  count. 

In  such  a case,  it  is  possible  that  fewer  processors  would  be 
needed  to  obtain  the  same  throughput*  When  considering  the 
16-kilobit  RAM  versus  the  faster  4-kilobit  RAM,  the  4-kilobit  RAM 
chip  would  require  a 4-fold  increase  in  the  number  of  memory 
components.  In  this  case,  a trade-off  between  the  reliability 
impact  of  a larger  number  of  memory  parts  and  possible  reduced 
costs  from  a smaller  number  of  processors  seems  to  indicate  that  a 
more  reliable  system  is  the  most  cost-effective.  It  takes  512 
processors,  at  120  ns  memory  cycle  (projected  for  80  ns  chips)  and 
40  ns  logic  clock,  to  yield  the  desired  throughput  of  one  billion 
floating  point  operations  per  second. 


5.2.4  Modularity 

Although  the  NASF  requirements  did  not  specifically  address  the 
problems  of  system  modularity,  the  FMP  design  described  below 
contains  a very  small  number  of  standard  modules.  These  modules 
are  the  Extended  Memory  module,  the  Connection  Network  switch 
module,  and  the  Processor  Module.  The  Processor  Module,  in  turn, 
consists  of  an  Execution  Unit  Module  and  a Processor  Storage 
Module,  There  is  also  a Data  Base  Memory  Storage  Module, 

This  modularity  allows  the  potential  of  configuring  smaller  (or 
larger)  systems  out  of  the  same  parts,  with  no  impact  on  a user's 
perception  of  the  system.  In  addition,  such  modularity  greatly 
simplifies  the  magnitude  of  the  design  task  for  a system  of  the 
required  capabilities  and  should  reduce  the  fabrication  costs 
since  there  will  be  many  copies  of  a small  number  of  parts  built, 

5.2.5  Preview  of  FMP  Component  Descriptions 

Following  is  a brief  description  of  each  of  the  elements  of  the 
FMP  together  with  a formatted  tabulation  of  pertinent  features  and 
a block  diagram  of  each. 

For  each  element  of  the  FMP,  there  is  a table  of  characteristics 
given,  A very  short  narrative  description  gives  the  intended 
function  of  the  element  in  user  programs.  Source  of  control  is 
identified,  and  the  storage  capabilities,  both  capacity  and  speed 
are  also  given.  Connectivity  to  other  elements  is  defined  in 
detail . 
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The  table  also  discusses  the  modes  of  erroi:  control  built  into  the 
design.  Most  of  these  mechanisms  are  discussed  in  more  detail  in 
Reference  1 and  Reference  2.  The  chip  count  is  that  projected  for 
a 1980  design.  ”TBD"  means  ”to  be  determined”. 

5 . 3 PROCESSOR 

The  array  of  512  processors  is  charged  with  the  task  of  executing 
the  user  computations  in  the  program^  namely  the  floating-point 
operations  on  the  problem  variables. 

The  processor  executes  code  contained  in  its  own  program  memory , 
and  accepts  commands  from  the  coordinator.  Certain  instructions 
are  executed  in  synchronism  with  the  coordinator  (and  hence,  by 
implication,  in  synchronism  with  the  entire  array,  since  the 
coordinator  expects  cooperation  from  all  processors.) 

The  actions  of  the  processor  are  delineated  by  the  instruction  set 
detailed  in  Appendix  C.  Figure  5.2  shows  the  division  of  the 
processor  into  an  Execution  Unit  (EU),  a Processor  Memory  (PM), 
and  a CN  Buffer  (CNB).  Table  5.1  provides  data  on  the  cha'acter- 
istics  of  the  processor  as  a whole. 


5.3.1  Execution  Unit  (EU) 


Figure  5.3  is  a block  diagram  of  the  Execution  Unit  (the  logic 
part  of  the  processor)  and  the  CN  Buffer,  showing  the  independent 
integer  and  floating  point  units,  with  separate  register  files  for 
each.  Figure  5.4  is  a diagram  of  the  instruction  fetching  and 
overlap  machinery.  Table  5.2  provides  data  on  the  Execution  Unit. 
Connections  to  the  processor  come  from  the  control  unit  and  the 
Connection  Network.  The  synchronization  signals  and  the  4-bit 
wide  command  path,  and  its  strobe  come  from  the  coordinator.  The 
data  paths  to  and  from  the  connection  network  are  each  accompanied 
by  a strobe.  In  addition,  each  processor  is  connected  to 
backplane  wiring  that  expresses  its  own  number. 

Of  the  129  processors  in  a cabinet,  any  one  may  be  the  spare 
processor.  Suppose  processor  No.  N is  the  spare  processor .Then 
the  backplane  number  for  processors  0 through  N-1  is  correct  but 
the  backplane  number  for  processors  N+1  through  128  must  be  shift- 
ed down  by  one,  to  N through  127,  in  order  that  the  processors 
being  used  by  the  program  be  consecutively  numbered.  Therefore, 
there  is  a 1-bit  signal  coming  from  the  spares-designating  machin- 
ery which  tells  the  processor  whether  or  not  to  subtract  1 from 
its  hard-wired  processor  number  to  correct  for  the  location  of  the 
spare*  Two  bits  of  processor  number  are  the  cabinet  number,  and 
do  not  enter  into  the  subtraction. 
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Table  5~1 


Processor  Characteristics 


Number  in  System:  512  (No.  of  on-line  spared:  4) 

Function 


To  execute  code  written  by  FMP  FORTRAN  compiler^  with  an 
upper  limit  on  speed  of  over  three  million  floating  point 
operations  per  second.  The  code  is  executed  cooperatively 

with  other  processors  and  with  the  coordinator. 

Node  of  Operation 

Execution  of  instructions  fetched  from  processors  own  memory; 
execution  of  commands  issued  by  the  coordinator  (diagnostics 
only);  interaction  with  EM  via  the  CN  buffer. 

Storage  Capacities 

32,768  words 

120  ns  cycle  (odd-even  interlace) 

static  RAM  technology 


Connectivities 

To/From  Function  or  Name 

No. 

Signals 

Timing 

CN 

Addresses  and  data  to  EM 
data  from  coord,  and  EM, 

. 24 

20  ns  per  11-bit  frame 
1st  frame  timed  with 
120  ns  CN  clock 

CR 

Commands  plus  strobe 

5 

Synch,  with  40  ns  clock 

CR 

Status  bits  to  coord. 

4 

Change  on  any  40  ns  clock 

CR 

“go”  from  coord. 

1 

40ns  pulse 

Backplane 

Processor  number 

10 

Wired-in  levels 

Fanout 

Spare  bit  and  spare 
designator 

2 

D.  C.  level 

Fanout 

Clocks 

2 

40ns  clock  pulse  enable 
for  selecting  every  3rd 
one  for  CN  clock 

Table  5-1*  Processor  Characteristics  (Cont*d) 


Rel iabil ity/Repair  abil ity/Tr ustwor  thiness 


SECDED  checker  on  data  bus 

Numerous  error  checks  leading  to  error  interrupts 
Parity  on  microprogram  memory 

For  operation  in  the  presence  of  failures  spare  processors 
can  be  switched  in,  or  SECDED  can  be  used  to  cover  up 
failures  in  PM  or  EM. 


Physical 

Projected  chip  count; 
Size: 

Power ; 

Additional  Constraints ; 


240 

1.2**  X 11.5”  27.5  (narrow  edge  to  backplane) 

325  watts  (including  lOOw  losses  in  the 
switching  regulator) 

Includes  own  self-contained  switching 
regulator 


LEAST  SIG. 


TRIGGER  TO  PM 


STAGING 

REGISTER 


START  TIME,  INT. 


-eH 


DECODE 


START  TIME,  FL.  OT 


START  TIME,  MEM 


INTEGER  UNIT 
INSTR.  REG. 


''ISSUE"  COMMAND 


HOLDING 
REGISTER 
(FOR  DELAYED 
ISSUE) 


FL.  PT.  UNIT 
INSTR.  REG. 


MEMORY 

CONTROLS 


END  TIME,  CURRENT  MEM.  OP. 
END  TIME,  CURRENT  FL.  PT.  OP. 


END  TIME,  CURRENT  (NT.  OP. 


TO  DECODING 


Figure  5.4  Instruction  Fetching  and  Overlap  Diagram 


5--14 


Table  5-2.  Execution  Unit  (portion  of  processor) 
Character istics 


Number  in  System:  1 per  processor 

Function 

Executes  instructions  and  coordinator  commands^  accesses 
processor  memory^  and  interfaces  with  CN  buffer. 

Mode  of  Qper ation 


Clocked  at  40ns  clock,  which  is  synchronous  throughout  entire 
system. 

Storage  Capacities 

32  words  in  addressible  registers,  a few  additoinal 
register  also 
40  ns  cycle 
ECL  technology 


Connectivities 


To/ From 


Function  or  Name 


Data  (bidirectional ) 
Address  and  command 


No. 

Signals 


Clocked 


Clocked 


Comments 


CN  buffer  Data  (both  directions) 

CN  buffer  Address  path 


CN  buffer  Controls 


Clocked 

Clocked  11  bits  EM  no. 

23  bits  address 


Fanout 


Fanout 


Synchronizatoin  & status 
Commands  from  coord. 


Backplane  Processor  number 


Fanout 


Sparebit 

Clocks 


Fanout  Spares  designator  and 

sparebit 
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Table  5-2.  Execution  Unit  (portion  of  processor) 
Characteristics  (Cont*d) 


Rel iabil i tv/ Repair  abil ity/Tr  ustwor  thiness 

Contains  SECDED  checker ^ microprogr am  parity,  etc., 
mentioned  under  processor 

FAiled  EU  spared  out  by  sparing  out  entire  processor 
Physical 

Projected  chip  count:  100 

Size;  About  11*'  x 10**  within  processor 


Power  : 


125  watts 


Error  control  within  the  processor  includes  SECDED  on  data  bus 
transfers^  parity  on  words  in  microprogram  memory,  and  the 
assortment  of  error  and  bounds  checks  as  listed  in  the  description 
of  the  interrupt  register# 

5.3.2  Processor  Memory  (PM) 

The  Processor  Memory  (PM)  contains  data  and  program  within  each 
X^rocessor.  Control  is  from  the  memory  address  register  in  the 
processor.  There  are  32,768  words  of  55  bits  each  consisting  of 
48  bits  of  data  and  7 bits  of  single-error  correcting,  double- 
error-detecting code.  Data,  address,  and  control  connections  are 
solely  to  the  processor.  16k-bit  static  RAM  chips  are  used. 
Table  5.3  describes  major  characteristics  of  the  PM. 

5.3.3  Connection  Network  Buffer  (CN  Buffer) 


The  CN  Buffer  accepts  address,  data,  and  commands  from  the  EU,  and 
in  response  to  those  commands,  may  transmit  requests  for  either 
store  or  fetch  to  a named  EM  module,  may  accept  data  from  the  CN 
and  may  transmit  data  to  the  CN.  The  CN  Buffer  accepts  commands 
from  the  EU  only.  The  "strobe"  or  "acknowledge"  received  from  the 
EM  module  via  the  CN  is  used  as  an  indication  of  the  success  of  EM 
requests . 

Transmissions  of  data  through  the  CN  are  synchronized  with  the  CN 
clock,  a submultiple  of  the  processor  clock.  All  CN  buffers  are 
synchronized  to  the  same  CN  clock  to  eliminate  time  races  in  the 
CN. 

Table  5.4  summarizes  the  characteristics  of  the  CN  Buffer.  Figure 
5.5  shows  the  states  taken  by  the  CN  Buffer  controls*  The  arcs  in 
the  graph  of  this  figure  are  labelled  with  the  events  that  cause 
change  in  state.  For  explanations  of  mnemonics,  see  the  instruc- 
tion set  in  Appendix  C.  All  eight  states  in  the  top  of  the 
diagram  are  seen  as  "busy"  by  the  EU.  A four  flip-flop  internal 
state  register  is  assumed.  The  six  command  lines  from  the  EU 
carry  different  commands  plus  "go."  Three  of  the  requests 
(STOREM,  LOADEM,  and  LOCKEM)  result  in  codes  being  appended  to 
addresses  sent  to  the  EM.  In  both  cases  where  "go"  is  shown  as 
triggering  the  change  of  state,  an  alternative  would  be  for  the 
"acknowledge"  signal*  on  the  12th  line  on  the  data  receiving  side 
of  the  CN  connection,  to  serve  instead. 

The  12  lines  going  from  CN  Buffer  out  to  CN  are  11  data  lines  plus 
a strobe  that  states  the  data  is  valid.  The  12  lines  coming  from 
CN  to  CN  Buffer  are  11  data  lines  plus  "acknowledge."  Each  11-bit 
piece  of  data  is  called  a "frame".  Acknowledge  is  transmitted  by 
an  EM  module  upon  successfully  receiving  a request  through  the  CN, 
and  stays  up  as  long  as  the  connection  is  to  be  maintained.  The 
CN  uses  the  acknowledge  to  latch  up  the  chosen  path,  so  the 
acknowledge  is  a logic  level  that  stays  up  during  the  duration  of 
the  single  operation. 


5-17 


Table  5-3*  Processor  Memory  (PM) 
(part  of  processor;  Characteristics 


Number  in  System:  1 per  processor 

Function 


To  hold  program  for  execution  by  the  CUr  and  data  to  be 
fetched  in  response  to  that  program* 

Mode  of  Operation 

Program  counter  (PCR)  and  memory  address  register  (MAR) 
contains  addresses  for  program  and  data  respectively.  The 
16k-bit  chips  assumed  by  the  implementation  of  choice,  allow 
the  interlace  of  odd  and  even  modules. 


Storage  Capacities 


32,768 

120 

NMOS  Static  RAM 
Connectivities 
To/ From  Function  or  Name 

words 
ns  cycle 
technology 

Signals 

Timing  Comments 

EU 

address 

16 

Clocked 

EU 

data 

110 

Clocked 

EU 

command 

5 

Clocked 

Rel iabil i ty/Repair  abil ity/Tr  ustwor  th iness 

SECDED  on  all  words  fetched  (SECDED  generator/checker  is  in 
the  EU) 

Detection  of  illegal  instructions,  detection  of  the  fetching 
of  "unitialized"  data,  detection  of  fetching  of  unnormalized 
floating  point  words, 

SECDED  allows  continued  operation  at  reduced  reliability  in 
the  face  of  single  bit  failures. 

Sparing  is  done  by  sparing  the  entire  processor. 
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Table  5--3*  Processor  Memory  (PM)  Characteristics  (Cont’d) 


Physical 

Projected  chip  count: 
Size: 

Power  i 


130 

11”  K 10”  board  in  processor 
lOOw 


Table  5-4 • CN  Buffer  (per  processor)  Characteristics 


Number  in  System;  1 per  processor  plus  1 in  coordinator 
Function 

To  serve  as  an  asynchronous  interface  with  the  CN^  decoupling 
the  program  running  in  the  PR  (or  the  coordinator)  from  the 
access  delays  of  EM  and  the  CN* 

Mode  of  Operation 

Three  registers  hold  EM  number  plus  operation  code,  EM 
address  within  module,  and  one  word  of  data.  EM  number 
serves  as  a request  for  an  EM,  when  transmitted  through  the 
CN.  The  address  register  is  loaded  by  the  CR,  and  sent  to 
the  EM  module  at  the  appropriate  time.  The  data  word  has 
bidirectional  connections  both  to  CR  and  CN. 

Storage  Capacities 

1 words 

40  ns  cycle 


Connectivities 


To/ From 

Function  or  Name 

signals 

Timing 

CN 

Data  path  (bidirectional) 

24 

20  ns  per  frame 

EU 

Data  (bidirectional) 

110 

120  ns  CN  clock 
initiations 

EU 

EM  module  no.  and  EM  co 
command 

14 

40  ns  clock 

EU 

Address  within  module 

22 

40  ns  clock 

EU 

Misc.  controls 

9 

40  ns  clock 

Fanout 

”busy" 

1 

Reliabil ity/Repairability/Trustwor thiness 

All  data  passing  through  the  CN  buffer  is  checked  at  desinta- 
tion  for  proper  SECDED  code 

Sparing  is  with  the  processor  of  which  the  CN  buffer  is  a 
part. 


5-20 


Table  5-4 


CN  Buffer  Characteristics  (Cont'd) 


Physical 

Projected  chip  count:  30  chips 

Size:  NA 


Power : 


NA 


f 

7 \ 

A 

EU 

(STOREM) 

EU 

(FltLEM) 
[ 

1 

EU 

(tOADEMMLOCKEM) 

. S 

EU 

(EMREQ) 

TRANSMITTING 
address  and 
data 


TRANSMITTING 

data 


BUSY, 

waiting  to  trans- 
mit address  and 
receive  data 


BUSY, 
waiting  to 
receive  data 


ACK 
From  EM 


TRANSMITTING 

address 


RECEIVING 

data 


FREM,  IREM,  lOREM  or  MREM 


Figure  5.5  CN  Buffer  State  Diagram 


The  CN  Buffer  also  contains  the  capability  of  remapping  from  an  EM 
module  number  of  an  EM  module  which  has  been  spared  out,  to  a 
different  EM  module  number.  There  are  528  backplane  slots  for  EM 
modules  in  the  system,  since  all  four  EM  cabinets  are  fabricated 
alike.  This  provides  for  up  to  seven  spares.  However,  the 
reliability  analysis  is  based  on  one  spare  per  cabinet,  and  only 
four  registers,  in  each  CN  Buffer,  are  planned  for  designating 
which  modules  are  spare  A 4 word  associative  memory,  recognizing 
any  one  of  four  10  bit  EM  module  numbers,  and  substituting  spare 
EM  module  numbers  for  them,  is  a suggested  implementation. 

5.3.4  Design  Rationale  and  Changes  from  Preliminary  Study 

Size  of  the  processor  memory  was  selected  on  the  basis  of  the 
known  requirements  of  the  implicit  3-D  codes.  In  the  preliminary 
study,  the  requirements  were  projected  to  be  16K  words  of  data  and 
8K  words  of  program.  In  this  feasibility  study,  we  have  determined 
that  it  is  less  expensive  to  use  a single  uniform  memory  with  no 
penalty  in  performance.  Therefore,  the  Processor  Memory  (PM)  now 
contains  both  program  and  data  and  is  sized  at  32K  words. 

As  design  progresses,  it  may  become  clear  that  64  kilobit  RAM 
chips  will  have  adequate  speed  for  this  application.  If  that  is 
the  case  and  if  the  price  is  only  twice  the  price  per  chip  of  the 
16  kilobit  RAM  chips  currently  planned,  then  the  design  would  be 
setup  to  use  64  kilobit  chips.  In  this  case  a 64K  word  PM  would 
result  giving  benefits  both  in  larger  storage  capacity  and  higher 
reliability  (fewer  parts).  See  section  5. 2. 3. 5 for  other 
discussion. 

Another  area  of  change  from  the  Baseline  System  (1)  was  the  intro- 
duction of  the  Connection  Network  Buffer  (CN  Buffer)  just  describ- 
ed. The  design  objective  of  the  CN  Buffer  is  to  provide  an  inde- 
pendent logic  unit  to  which  the  CN-related  operations  can  be 
passed  while  the  EO  proper  continues  processing.  Waiting  for  EM 
access,  or  for  CN  connections,  can  be  done  in  parallel  with  other- 
processing  instead  of  being  in  series  with  program  execution.  It 
is  included  in  response  to  the  asynchronous  nature  of  the  CN. 

5.4  COORDINATOR  (CR) 

The  coordinator  serves  two  functions.  The  first  is  to  serve  as 
the  focal  point  for  array-wide  synchronizations  and  array-wide 
cooperation.  To  this  end,  the  coordinator  is  supplied  with  an 
array-wide  synchronization  mechanism,  namely  the  "all  processors 
ready",  "go",  "any  processor  enabled",  "any  processor  in  interrupt 
mode",  and  so  on,  as  well  as  an  access  port  to  the  CN  which,  in 
combination  with  processor  cooperation,  allows  the  passing  of  a 
single  piece  of  data  from  coordinator  or  f i om  one  EM  module  to  all 
processors,  or  from  all  processors,  combined  into  a single  woj d to 
the  coordinator. 
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During  diagnostics  and  initialization,  the  array-wide  cooperation 
is  imposed  on  the  processors  by  the  coordinator,  which  has  a set 
of  commands  that  are  designed  to  read  and  write  every  accessible 
register  within  the  processor,  and  generally  to  exercise  any 
intraprocessor  activity* 

The  second  coordinator  function  is  to  run  system  software, 
interface  with  the  support  processor,  and  with  the  DBM  controller 
for  DBM-EM  transfers,  and  also  to  be  exercised  by  the  diagnostic 
controller.  Note  that  DBM  access  requests  from  the  coordinator 
are  in  terms  of  file  identifiers,  not  addresses. 

The  host  initiates  transfers  between  file-system  and  DBM  using  the 
DBM  allocation  map  and  issuing  I/O  commands  directly  to  the  DBM 
controller.  No  PMP-resident  routine  is  involved'  in  the  initia- 
tion or  completion  of  these  transfers*  The  DBM  controller  resol- 
ves any  potential  conflict  between  these  host  transfers  and  a 
coordinator-CR-initiated  DBM-EM  transfer. 

Figure  5*6  shows  the  Coordinator's  two  connections  to  the  CN.  One 
connection  is  a CN  Buffer  identical  to  the  CN  Buffer  of  the 
processor,  and  is  used  to  access  EM.  The  other  connection  is 
logically  a memory  port,  and  is  used  for  injecting  data  to  be 
broadcast  to  all  processors,  or  for  accepting  data  that  has  been 
harvested  in  parallel  from  all  processors. 

The  Coordinator  can  be  controlled  by  commands  from  the  host 
(Support  Processor)  computer  issued  via  the  Diagnostic  Controller. 
This  interface  is  used  to  support  the  necessary  interaction 
between  the  portions  of  the  FMP  Operating  System  resident  in  the 
Support  Processor  and  in  the  Coordinator.  In  addition,  the  Support 
Processor  can  use  this  interface  to  initiate  maintenance  support 
procedures. 

The  speed  of  the  Coordinator  is  set  by  the  need  to  execute  system 
software  fast  enough  not  to  hold  up  user  programming.  That  is, 
the  Coordinator  needs  to  be  executing  system  software 
substantially  less  than  the  processors  are  processing  user  code. 
Handcompiled  samples  show  that  the  Coordinator  is  almost 
completely  idle  during  execution  of  user  code.  It  will  be  recom- 
mended that  system  software  be  allowed  to  execute  along  with  user 
code,  letting  "all  processors  ready”  and  "processor  interrupt" 
pull  the  coordinator  back  to  the  user's  code  as  required.  It  is 
also  recommended  that  software  conventions  allocate  certain 
coordinator  registers  for  user  program  use  only,  and  others  for 
system  program  use  only,  thereby  eliminating  much  of  the  swap 
time. 

Figure  5.7  shows  the  block  diagram  of  the  Coordinator*  Table  5*5 
summarizes  the  characteristics  of  the  Coordinator. 


5-24 


MEMORY 

(CRM) 


HOST/DC 


COMMUNICATIONS 

REGISTER 


ERROR  DETECTION 
AND  CORRECTION 


INSTRUCTION 

BUFFER 


TO  CN  (ACCESS  TO  FROC) 


TO  CN  (ACCESS  TO  EM) 


I/O 

INSTRUCTION 

DECODE 


I TO  CRM 


TO/FROM 

DC/DBM 


INTEGER 

UNIT 

LOGIC 


TO  ADDRESS 
REGISTER  AND 
CN  BUFFER 


INTEGER 

REGISTERS 


Figure  5*7  Coordinator  Block  Diagram 


Table  5*5  Coordinator  Characteristics 


Number  in  Systems  1 


Function 


Serves  as  a focal  point  for  the  achievement  of  array-wide 
cooperation  of  processors;  serves  as  the  issuing  point  of 
array-wide  diagnostics ♦ 

Runs  most  FMP  operating  system  segments , including  inter- 
action with  host^  logging  of  error  events  hardware 
reconfigurations. 


Mode  of  Operation 


Executes  program.  Interrupt  mechanism  allows  switching  back 
and  forth  between  the  two  modes  of  operation. 


Storage  Capacities 

32  registers  (possibly  more) 
40  ns  cycle 

ECL  register  technology 
Connectivities 


To/From  Function  or  Name 


No. 

Signals  Timing 


Host  I/O  channel  TBD 

DBM  Descriptor  issuance , TBD 

controller  status 
return 

EM  Clock  EM  via  EM 

fanout  tree  2 

EM  Error  interrupts  from  EM  2 

CN  Control  24 

CN  From  CN  buffer  24 

CN  to  EM-1 ike  port  24 


Proc.  via  Command  and  strobe  5 

fanout 

Proc.  via  Synch  5 

fanout 


TBD 

TBD 


40  ns  Clock  pulses 
120  ns  Select  every 
3rd  as  CN  clock 

with  CN  clock 
20ns  per  frame;  starts 
synch  with  CN  clock 
same  but  CN  clock  at 
this  port  is  60ns  off 
from  CN  clock  at 
CN  buffer 
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Table  5~5.  Coordinator  Characteristics  (Cont*d) 


Rel iabil ity/Repair abll ity/Tr  ustwor  thiness 

Repertoire  of  error  and  veasonableness  checks  leading 
to  error  interrupt. 

SECDED  on  data  bus  checks  from  coordinator  memory,  from  CN 
buffer,  and  from  CN  to  BDCST  and  HVST.  Available  for 
checking  channels  to/from  host  and  DBM  controller  also. 

Diagnostic  controller  has  direct  access  to  coordinator  state. 


Physical 

projected  chip  count; 
Size: 

Power ; 


2,000 

20  to  30  large  p/c  boards 
Not  estimated 
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5.4.1  Execution  Logic 

The  Coordinator  has  a number  of  semi- independent  execution 
stations/  so  that  more  than  one  instruction  may  be  in  the  process 
of  execution  at  any  given  time,  just  as  in  the  processor  • The 
degree  to  which  overlap/  and  its  additional  logiC/  are  worthwhile/ 
is  a function  of  the  amount  of  system  software  that  the 
coordinator  is  required  to  execute.  Using  only  the  two 
aerodynamic  flow  models  as  benchma) ks  tells  us  that  no  overlap  is 
jequired.  Therefore  the  specification  of  a mechanism  of  overlap/ 
as  seen  in  the  instruction  listings/  is  only  tentative  pending 
further  clarification  of  the  computational  lead  imposed  by  systems 
programming.  The  units  ares 

(1)  Arithmetic  unit, 

(2)  Memojy, 

(3)  Interface  to  Support  Processor  and  DBM  controller,  and 

(4)  CN  buffer. 

Instruction  timing  is  given  in  Appendix  D. 

5.4.2  Coordinator  Memory 


The  Coordinator  Memory  holds  both  prog) am  and  data  for  the 
Coordinator.  It  is  add)  essable  only  f)om  the  Coodinator  and 
sends  all  data  into  the  central  data  bus  of  the  Coordinator. 

The  Cooj’dinato)’  Memory  is  identical  in  electrical  design  and  uses 
the  same  16k-bit  RAM  chips  as  the  processor  memories.  The  size 
resulting  f i om  considerations  of  the  flow-model  matching  study  is 
32/768  wotds. 

Table  5.6  summarizes  the  characteristics  of  this  memory.  Note 
that  it  is  identical  to  the  P)  ocessor  Memories  in  all  respects. 
As  with  PM,  where  the  processor  has  a SECDED  generator-checker  for 
all  memory  words,  so  he)e  the  coordinator  has  SECDED  also. 


5.4*3  Design  Rationale  and  Changes  from  Pieliminary  Study 

The  change  from  the  old  vecto) -or iented  transposition  netwo) k of 
the  preliminary  study  to  the  Jandom  access  connection  network  of 
the  design  curiently  described  has  released  the  processors  from 
all  requirements  on  regularity  of  relationship  between  the  data 
processed  by  one  processor  and  the  data  processed  by  any  othe)  • 
We  now  truly  have  512  separate  scalar  processors  in  the  PMP. 
Hence,  all  desire  to  have  a separate,  diffeient,  scalar  processor 
associated  with  the  Coordinator  has  disappeared,  and  the  scalar 
processor  in  the  control  unit  of  Ref.  2 has  not  been  carried  over 
into  the  Coordinator. 
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Table  5-6*  Coordinator  Memory  (CM)  Character istics 


Number  in  System:  1 

Function 

To  hold  program  for  execution  by  the  coordinator  and  data  to 
be  fetched  in  response  to  that  program. 

Mode  of  Operation 

Program  counter  (PCR)  and  memory  address  register  (MAR) 
contain  addresses  for  program  and  data  respectively.  The  16k- 
bit  chips  assumed  by  the  implementation  of  choice,  allow  the 
interlace  of  odd  and  even  modules. 

Stor age  Capacities 

32,768  words 

120  ns  cycle 

NMOS  static  RAM  technology 

Connectivities 


To/Fr om 

Function  or  Name 

Signals 

Timing 

Comments 

Coordinator 

add?  ess 

16 

Clocked 

Coordinator 

data 

110 

Clocked 

Coordinator 

command 

5 

Clocked 

Rel iabil i ty/Repair  abil ity/Tr  us twor  thiness 

SECDED  on  all  words  fetched  (SECDED  generator/checker  is  in 
the  coordinator) 

Detection  of  illegal  instructions,  detection  of  the  fetching 
of  "uninitialized”  data,  detection  of  fetching  of  unnormaliz- 
ed floating  point  words. 

SECDED  allows  continued  operation  at  reduced  reliability  in 
the  face  of  single  bit  failures. 


Physical 


Projected  chip  count: 
Size : 

Power : 


130 

11”  X 10”  board  in  CR 
lOOw 
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5*5  PROCESSOR  COORDINATOR  INTERACTION 


5*5,1  Instruction  Streams 

The  PMP  is  controlled  by  two  instruction  streams^  which  are 
created  in  parallel  by  the  compiler  from  a single  sequence  of 
source  statements.  One  instruction  stream  is  being  executed  in 
the  Coordinator;  the  other  is  being  executed  by  all  processors 
asynchronously  of  each  other.  Some  statements  in  the  source  code 
result  in  instructions  in  both  instruction  streams.  Some  of  these 
joint  instructions  require  that  the  Coordinator  and  the  processors 
synchronize  themselves, 

5,5.2  Synchronization 

The  simplest  synchronization  that  may  occur  is  the  WAIT 
instructioHr  in  which  the  processor  sets  '*I  got  here**.  The 
coordinator  is^  or  will  be,  executing  a SYNC  instruction.  The 
SYNC  instruction  waits  until  **all  processors  ready**  becomes  true, 
**A11  processors  ready**  is  the  512-way  AND  of  each  processors  **I 
got  here**  OR  NOT  ** enabled*'.  That  is,  it  is  the  N-way  AND  of  the  N 
enabled  processors.  After  seeing  **all  processors  ready**,  the 
coordinator  issues  a **go**  command,  received  simultaneously  by  all 
processors,  which  then  reset  their  **I  got  here**  and  execute  the 
next  instruction. 

When  the  processor  has  raised  its  ”I  got  here**  line,  but  before  it 
has  received  a **go”  signal,  it  is  said  to  be  **waiting”.  The  **I 
got  here**  line  is  dropped  upon  receipt  of  the  ”go**  pulse. 

A processor  is  not  required  to  be  idle  while  the  **I  got  here**  is 
set.  Commands  are  provided  to  set  the  flag  and  to  allow 

processing  to  continue.  However,  each  **I  got  here"  is  considered  a 
separate  event  so  if  the  processor  continued  execution  and  wished 
to  identify  another  **I  got  here**  event,  that  command  must  wait  as 
required  for  the  flag  to  be  cleared  by  a **go**  command  from  the 
Coordinator . 

5*5,3  Interface 

Table  5,7  contains  a list  of  Processor-Coordinator  Interface 
signals  and  identifies  their  use. 

In  addition  to  the  above  synchronization,  the  CR  also  has  the 
power  to  transmit  commands.  The  commands  are  carried  on  a 

4-bit-wide  bus  accompanied  by  a strobe  line.  Many  of  these 

commands  are  used  in  the  diagnostic  programs.  Some  of  these 

commands  are  conditional  on  the  **enable**  bit  of  the  processor, 
some  are  unconditional  independent  of  the  enable  bit.  No  such 
command  is  used  in  user-generated  FORTRAN  programs,  after  initial 
program  loading. 
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Table  5-7.  Processor-Coordinator  Interface 


Processor 

To  or  Prom 
Processor 

Coordinator 

"enabled” 

from 

"any  processor  enabled"  « 
512-way  OR  of  "enabled" 

"I  got  here” 

from 

"all  processors  ready" 
512-way  AND  of  ("I  got  here" 
OR  NOT  "enabled") 

to 

"Go"  signal  to  CN  buffer 

"Interrupt  coordinator" 

from 

"processor  interrupt  = 
512-way  OR  of  "interrupt 
coordinator"  (a  bit  in  the 
coordinator  interrupt 
register 

"Interrupt  mode" 

from 

"any  processor  in  interrupt 
mode"  = 512-way  OR  of  "inter- 
rupt mode"  (tested  by  PINT 
instruction 

"sparebit" 

to 

Designation  of  processor 

"spare  " 

to 

number  of  spare  procesor 

4*-bit  Command  Bus 

to 

Synchronization  and  diagnostic 
mode  command 

In  addition  to  the  above  synchronization,  the  CR  also  has  the 
power  to  transmit  commands.  The  commands  are  carried  on  a 4-bit- 
wide  bus  accompanied  by  a strobe  line.  Many  of  these  commands  are 
used  in  the  diagnostic  programs.  Some  of  these  commands  are  con- 
ditional on.  the  "enable"  bit  of  the  processor,  some  are  uncon- 
ditional independent  of  the  enable  bit.  No  such  command  is  used 
in  user-generated  FORTRAN  programs,  after  initial  program  loading. 


5*5,4  Fan-Out  Tree  ( Coordinator-to-Processors) 

A series  of  fan-out  boards  are  supplied  to  implement  the 
Coordinator-to-  Processor  Interface.  Signals  and  clock  fan  out 
from  the  Coordinator  to  the  final  516-processor  destinations. 

Prom  the  processors,  the  signals  are  combined,  so  that,  within  the 
Coordinator  a single  result  appears  in  response  to  516  signals 
emitted  by  the  processors.  For  example,  the  ”all  processors 
ready”  signal  becomes  true  at  the  clock  that  the  last  enabled 
processor  emits  ”I  got  here”.  Another  such  signal  is  the 

516-input  OR  of  "enabled”. 

At  the  processor,  some  signals  are  wired  per-processor  directly  to 
the  last  level  of  fanout  board?  others  are  daisy-chained  to  eight 
processors  from  a single  signal  pin  on  the  last  board.  The  fanout 
boards  are  pin-limited.  Simple  buffers  with  one  input  pin  and  one 
output  pin  per  signal  dominate  the  circuit  count,  so  hex  buffers, 
easily  available  today,  will  not  be  improved  upon  by  1979-1980. 

Figure  5.8  shows  the  Pan-out  Tree.  Table  5,8  summarizes  the 
character  is tics . 

5.6  EXTENDED  MEMORY  MODULE 

Extended  memory  (EM)  is  the  "main”  memory  of  the  FMP,  in  that  it 
holds  the  data  base  for  the  program  during  program  execution. 
Temporary  variables,  or  work  space,  can  be  held  in  either  EM  or 
Processor  Memory  (PM),  as  appropriate  to  the  problem.  All  I/O  to 
and  from  the  FMP  is  to  and  from  EM  via  DBM.  Control  of  the  EM  is 
from  two  sources,  the  first  is  instructions  transmitted  over  the 
CN,  the  second  is  the  DBM  controller  which  handles  the  DBM-EM 
transfers. 

The  Extended  Memory  consists  of  521  on-line  modules,  and  four 
spare  modules,  not  used  by  the  working  program.  Data  is  allocated 
to  EM  across  the  modules,  with  the  allocation  EM  module  number  == 
Address  modulo  521  (address  is  least  significant  portion)  and 
address-within-module  = address/512. 

This  addressing  mode  was  chosen  as  a result  of  a software 
decision.  Vectors  are  an  important  fetching  pattern  in  the 
planned  NASF  applications  (i.e*,  one  vector  element  to  each 
processor).  It  is  therefore  desirable  to  design  the  system  so 
that  vectors  of  512  elements  will  be  in  512  separate  modules, 
reducing  memory  conflicts  and  allowing  simultaneous  access  to  EM 
for  all  processors.  The  number  521  is  chosen  because  it  is  a 

prime  number  larger  than  the  number  of  processors  (512).  This 

combination  then  contributes  to  the  above  desirable  properties. 
For  a more  detailed  discussion,  see  Ref.  1 & Ref.  2. 
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512  REQUIRED 


Figure  5,8  Processor  Coordinator 


Fanout  Tree  Block  Diagram 


Table  5-8 


Fanout  (Coord-Processor ) Characteristics 


Number  in  System:  1 


Function 


Provide  512-to-l  connectivity  from  processors  to  coordinator. 
Provides  l-to-516  connectivity  from  coordinator  to  proces- 
sors. Provides  l-to-129  connectivity  from  cabinet  number  to 
processors  within  cabinet. 

Modes  of  Operation 

Passive  repetition  of  signals.  No  registers  or  program 
execution  occurs  within  the  fanout  tree. 

S tor  age/Capac it ies 

none  words 

ns  cycle 
technology 

Connectivities 

To/ From  Function  or  Names 


Coord.  Synch,  status,  and  command 

and  clock 

Synch,  status  command,  14  per  processor 

clock,  and  cabinet  no. 

Reliability/Repairability/Trustwor thiness 

Very  low  parts  count  makes  additional  reliability  precautions 
unnecessary 

Physical 

Projected  chip  count:  900  (of  which  832  are  hex  buffers  of 

one  sort  of  another) 

Size:  36  boards,  4 cabinet  boards,  8 row 

fanout  boards  per  cabinet 


No. 

Signals  Timing  Comments 
19  Clocked 
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5.6*1  Basic  Characteristics 


Each  EM  module  has  a storage  capacity  of  64K  words  (48  bits  data 
plus  7 SECDED  bits/word). 

From  each  EM  module  we  need  a transfer  rate  and  access  time  consis- 
tent with  the  most  economical  implementation.  An  implementation 
in  64K'-bit  dynamic  RAM  is  chosen  for  availability  by  1980.  The 
lov?  chip  count  enhances  reliability.  A 240  ns  cycle  time  of  the 
memory  is  projected.  Each  word  carries  single-error-correction- 
double-^error-detection  code  which  is  generated  at  the  source 
(DBM,  CR,  or  processor)  and  also  checked  there,  so  that  transfer 
paths  are  covered  by  the  same  error  control  as  the  contents  of  EM. 
Figure  5.9  shows  the  general  organization  of  each  EM  module.  Table 
5.9  summarizes  the  EM  characteristics. 

5.6.2  Connection  Network  (CN)  Interface  from  Processors 

commands  accepted  by  the  EM  module  come  either  from  the  CN  or 
from  the  DBM  controller.  From  the  CN,  a "strobe"  signals  the 
arrival  of  a request.  The  EM  module  number  accompaning  the  strobe 
is  matched  against  the  module’s  own  number  for  error  control 
purposes.  Following  the  acceptance  of  the  request  by  the  EM,  an 
"acknowledge"  bit  is  raised  by  the  EM  module  which  locks  up  the  CN 
path,  and  tells  the  requestor  (processor  or  coordinator)  that  the 
request  is  being  honored. 

Following  the  strobe,  and  accompanying  the  address  field,  will  be 
any  one  of  four  different  commands,  namely: 

(1)  STOREM.  Data  will  follow  the  address;  keep  up  the 
acknowledge  until  the  last  character  of  data  has 
arrived.  The  timing  is  fixed;  the  data  item  will  be 
just  one  word  long. 

(2)  LOADEM.  Access  memory  at  the  address  given,  sending  the 
data  back  through  the  CN,  meanwhile  keeping  the 
"acknowledge"  bit  up  until  the  last  11  bits  frame  has 
been  sent. 

(3)  hOCKEM.  Same  as  LOADEM  except  that  following  the  access 
of  data,  a ONE  will  be  written  into  the  least 
significant  bit  of  the  word.  If  bit  was  ZERO,  the 
pertinent  check  bits  must  also  be  complemented  to  keep 
the  SECDED  code  correct.  The  old  copy  is  sent  back  over 
the  CN. 

(4)  FETCHEM.  Same  as  LOADEM  except  that  the  "acknowledge" 
is  dropped  as  soon  as  possible.  The  coordinator  has 
sent  this  code  to  imply  that  it  will  switch  the  CN  to 
broadcast  mode  for  the  accessed  data.  The  data  is  then 
sent  into  the  CN  which  has  been  set  to  broadcast  mode  by 
the  coordinator,  and  will  go  to  all  processors. 

All  of  the  above  commands  may  arrive  at  any  CN  clock  cycle.  Ex- 
cept for  the  clocking,  there  is  no  synchronism  imposed. 
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Table  5-9.  Extended  Memory  Module  (EM  module) 
Characteristics 

Number  in  system:  521  (No.  of  on-line  spares:  4) 

Function 


Serves  as  main  memory  for  array  processor;  serves  as  shared 
memory  among  the  processors. 

Mode  of  Operation 


Storage  Capacities 

65|636 

240 

MOS  dynamic  RAM 


words/module  x 55  bits  (48  data) 

ns  cycle 

technology 


Connectivities 


To/ From  Function  or  Name 


No. 

Signals 


Timing 


CN 


Data,  Addresses, 
Commands 


24  20  ns  per  frame 

1st  frame  synch, 
to  120  ns  clock 


DBM  cont.  Read,  Write,  to  DBM  36  Clocked  by  CN  clock 

via  EM 

fanout 


Rel iabil ity/Repairability/Trustworhiness 


All  data  is  covered  by  SECDED.  The  generators  and  checkers 
are  contained  in  the  elements  that  are  the  source  and 
destination  of  the  data. 


A parity  checker  checks  parity  on  the  module-number/address/ 
op-code  fields  received  through  the  CN. 

Physical 

Projected  chip  count:  85  (55  memory  chips) 

Size:  One  11"  x 10"  board 

Additional  constraints:  Each  EM  module  may  be  self-con- 

tained for  power  regulation,  just 
as  is  the  processor,  to  simplify 
power  distribution. 


A 


5.6.3  DBM  Interface 

In  addition  to  the  above,  there  are  two  commands  that  result  in 
cycle'-stealing  for  EM-DBM  transfers.  These  commands  and  their 
addresses  come  from  the  DBM  controller: 

(1)  Read  from  address  to  one-word  buffer,  and 

(2)  Write  to  address  from  one-word  buffer. 

The  one-word  buffers  are  loaded  from,  or  unloaded  to,  the  data  bus 
to  DBM  under  DBM  controller  control. 

A transfer  rate  of  20  nanoseconds  per  word  (50  million  words  per 
second)  is  achieved  on  this  bus.  Every  20  nanoseconds,  the 
controls  associated  with  this  bus  increment  EM  module  number. 
Decoding  logic  for  this  module  number  is  found  in  the  EM  fanout 
tree,  where  it  is  made  conditional  on  the  designation  of  spare  EM 
module.  The  EM  address  space  has  512  words  at  each  EM  address  to 
simplify  the  address  computations  within  the  program.  For  writing, 
the  EM  modules  are  cycled  after  512  words  are  loaded  into  the 
1-word  buffers,  and  those  EM  modules  whose  buffers  are  flagged 
^’full”  write,  while  the  nine  others  do  not.  For  reading,  all  521 
EM  modules  are  caused  to  cycle,  but  only  the  512  valid  words  at 
this  address-within-module  will  be  transferred  to  DBM. 
Incrementing  of  module  number,  for  loading  or  unloading  the  1-word 
buffers,  is  done  in  modulo  521.  The  address-within  module  is 
broadcast  from  the  DBM  controller,  and  is  incremented  every  512 
words  transferred. 

5.6.4  EM  Fanout 

A second  fanout  tree,  similar  to  that  between  the  coordinator  and 
the  processors,  comes  from  the  DBM  controller  and  carries  requests 
for  EM  cycles  from  that  controller. 

It  also  carries  EM  addresses,  and  the  two  clock  lines  to  the  EM. 
Because  of  the  requirement  for  addresses,  this  one  has 
substantially  more  parts. 

From  the  DBM  controller  comes  address,  command,  clocks,  and  timing 
for  loading  or  unloading  the  one-word  buffers  in  the  EM  module. 
From  the  EM  modules  comes  an  “error”  signal.  Spares  designation 
is  done  by  controlling  processor  access,  not  by  switching  EM 
modules  in  and  out,  so  no  spares  designation  signals  are  in  this 
tree.  Figure  5.10  shows  the  EM  Fanout  Tree.  Table  5.10  summaries 
the  characteristics  of  this  Fanout  Tree. 


5.6.5  Design  Rationale 

Size  of  the  EM  module  is  in  direct  response  to  Ames*  statements 
about  the  size  of  the  data  base  of  the  aero  flow  codes  they  expect 
to  run  on  the  NASF.  Speed  of  the  EM  module  is  derived  from 
observations  about  the  number  of  EM  accesses  necessary  to  support 
a given  quantity  of  floating  point  operations  in  the  processor. 
The  range  of  floating  point  operations  per  EM  access  was  observed 
to  typically  lie  between  5 and  20  for  the  aero  flow  codes.  The 
resulting  EM  access  times  were  seen  not  to  impact  the  running  time 
of  the  entire  aero  flow  codes , although  some  minor  sections  of 
those  codes  were  noticeably  slowed  by  an  accessing  EM,  at  the 
currently  designed  speeds. 

It  should  be  noted  here  that  advances  in  semiconductor  memory 
technology  may  make  it  feasible  to  consider  use  of  256-kilobit 
chips  instead  of  the  current  64-kilobit  chips.  Also  in  the 
future,  64-kilobit  chips  can  be  expected  to  be  reasonably  faster 
than  the  current  chips.  Therefore,  depending  on  when  final  design 
decisions  are  made,  a tradeoff  could  be  made  between  the  following 
options : 

(1)  256K  words/module  x 521  modules  (large  storage),  or 

(2)  64K  words/module  x 521  modules  (current  size  but 
faster ) . 

The  considerations  will  be  that  option  (1)  would  have  much  larger 
on-line  storage  with  no  impact  on  performance  projections.  Option 
(2)  assumes  existing  plans  for  data  storage  requirements,  but  the 
faster  parts  would  result  in  a faster  system  and  increased 
throughput  (note  that  here  one  could  consider  fewer  processors  and 
lower  cost  to  get  the  requested  throughput). 

5.7  CONNECTION  NETWORK  (PROCESSORS  TO  EXTENDED  MEMORY) 

A flexible  means  of  communication  between  the  processors  and  the 
Extended  Memory  modules  is  required.  In  order  to  achieve  a 
reasonable  compromise  between  performance  and  hardware  cost,  the 
connection  network  is  based  on  the  "Omega**  network  (ref  6)  rather 
than  on  the  crossbar  switch.  The  resulting  network  provides  a 
path  from  each  processor  to  the  EM  module  selected  by  that 
processor*  The  network  does  not  have  a central,  global  control. 
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Figure  5.10  EM  Fanout  Tree  Block  Diagram 


Table  5-10.  EM  Fanout  Characteristics 


Number  in  System:  1 

Function 

Distribute  addresses  and  commands  from  DBM  controller  to  EM 
modules.  Distribute  clock  from  DBM  controller  to  EM  modules. 

Mode  of  Operation 

Passive  logic,  no  flip-flops,  no  execution  of  commands. 

Storage  Capacities 

none  words 

ns  cycle 
technology 


Connectivities 

To/ From 

Function  or Name 

No. 

Signals 

Timing  Comments 

DBM  cont. 

Addresses,  control 

34 

22  bits  of  ad< 

Coord . 

Clocks 

2 

EM  mod. 

16  above 

36 

per  module 

Reliabil ity/Repairability/Trustwor thiness 

Low  parts  count  makes  additional  reliability  precautions  un- 
necessary in  comparison  to  the  reliability  of  the  rest  of  the 
FMP. 

Physical 

Projected  chip  count;  1250  (of  which  116  are  hex  buffers  of  one  sort 

or  another ) 

Size:  36  boards 


The  requirements  put  on  the  Connection  Network  are  that  it  have 
the  immediate  response  to  connectivity  requests  (tens  of  nano- 
seconds), that  it  have  on  the  order  of  NlogN  parts,  as  does  the 
Omega  or  the  Benes  network  instead  of  the  parts  of  the  crossbar 
switch,  and  that  like  a crossbar  it  be  able  to  provide  all  N paths 
simultaneously  when  the  requests  for  connection  are  a p-ordered 
vector,  and  that  it  be  able  to  handle  almost  all  N paths  at  once, 
with  only  modest  delay  imposed  on  a few  of  the  requests,  when  the 
requests  do  not  form  a p-ordered  vector*  All  of  these  can  be 
accomodated  in  a design  based  on  the  connectivity  of  the  Omega 
network  as  shown  in  Fig*  5.11* 

The  network  has  been  designed  with  the  added  capability  of 
processor  to  processor  connection  and  provides  transfer  paths  to 
and  from  the  coordinator*  Although  the  path  connectivity  of  the 
network  cannot  be  externally  controlled,  special  communications 
modes  (such  as  ''broadcast”)  are  available  under  control  of  the 
Coordinator  * 

The  following  discussion  requires  the  use  of  certain  definitions, 
as  follows; 

A ”p-ordered  vector"  is  a set  of  requests  in  which  the  EM 
module  number  being  accessed  by  processor  N is  equal  to  (d  + 
pN)  modulo  521,  where  d is  called  the  "offset",  and  p is  the 
"skip  distance"  • When  p is  also  the  distance  between 
successive  addresses,  p has  also  been  called  the  "stride". 
"Stride"  modulo  521  equals  "skip  distance*" 

A "p-q-ordered  vector"  is  defined  in  Appendix  B,  as  a set  of 
requests  from  processors  0 through  511  such  that  processor 
number  i is  requesting  from  memory  module  Mj^  given  by  Mj^  - 
(a  + p*i  + q*((i-b)DIV  k))  modulo  521*  In  this  equation,  k 
is  the  length  of  each  piece  of  vector,  p is  the  skip  distance 
within  each  piece,  and  q is  the  additional  skip  distance 
between  pieces*  The  constant  a is  the  offset.  The  constant 
b is  the  amount  by  which  the  first  piece  is  short,  since  the 
first  piece  might  be  a leftover  from  some  previous  fetching 
of  a p-q-ordered  vector*  A simple  example  is  shown  in 
Appendix  B* 

5.7*1  Functional  Description 

The  Connection  Network  (CN)  has  two  modes  of  control.  First,  in 
the  normal  mode,  the  CN  establishes  connections  to  the  Extended 
Memory  under  control  of  the  Processors.  Second,  the  Coordinator 
may  use  the  Network  for  a number  of  special  purposes  as  described 
below. 


PORT 

NO. 


PORT 

NO. 
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In  the  normal  mode,  a "request”  establishes  a two-way  connection 
between  requesting  processor  and  the  requested  EM  module*  The 
establishment  of  the  connection  is  acknowledged  by  the  EM  module* 
The  "acknowledge”  is  transmitted  to  the  requestor*  The  release  of 
the  connection  is  initiated  by  timing  internal  to  the  EM  module. 
Only  one  request  at  a time  arrives  at  a given  EM  module.  The  CN, 
not  the  EM  module,  resolves  conflicting  requests. 

The  following  states  of  the  connection  network  are  established  on 
command  from  the  coordinator. 

(1)  "Broadcast  from  coordinator”*  One  word  of  data  is 

distributed  from  the  coordinator  to  all  processors* 

(2)  "Harvest  to  Coordinator”.  One  word  of  data, 

representing  the  AND  or  OR  or  some  mixture  thereof,  of 
the  words  presented  by  each  of  the  enabled  processors, 
is  received  at  the  Coordinator.  Expected  to  be  used  by 
diagnostics  with  just  one  processor  enabled* 

(3)  "Broadcast  from  EM”.  The  EM  module  previously 

identified  by  a request  from  the  Coordinator,  will  have 
the  data  being  emitted  by  it  broadcast  to  all 

processors  * 

(4)  "Wraparound  at  stage  n".  Each  pair  of  processors  whose 
number  differs  by  the  bit  at  the  nth  bit  position  shall 
be  connected,  and  data  shall  be  swapped  between  them 
using  the  bidirectional  path  established.  Processors 
whose  port  numbers  are  separated  by  2*^  swap  data* 

(5)  Diagnostic  control 

(6)  "Null".  Respond  to  processor  requests  normally. 

The  connection  network  appears  to  be  a dial-up  network  with  up  to 
512  callers  the  processors,  possibly  dialing  at  once.  There  are 
512  processor  ports,  521  EM  module  ports,  and  two  coordinator 
ports,  one  of  which  "looks  like"  a processor  port,  and  the  other 
like  an  EM  port.  Processor  ports,  and  the  coordinator  port,  are 
capable  of  accepting  "requests". 

The  time  required  to  set  up  each  path  is  commensurate  with  the 
access  time  of  EM,  which  in  turn  is  designed  to  be  suitable  for 
the  number  of  EM  accesses  observed  in  the  applications  studied* 
In  the  CN  design  described  in  this  section,  the  minimum  time  to 
set  up  a connection  is  120  ns.  This  time  is  achieved  for  most 
cases,  including  specific  cases  that  are  important  in  the  aero 
flow  and  weather  code  applications  studied* 
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5.7.2  CN  Complexity  Considerations 


P 


The  basic  Omega  network  provides  only  one  possible  path  from  a 
given  processor-side  port  to  an  EM-side  port,  A network  of  this 
sort  may  experience  blockage,  especially  during  periods  of  heavy 
simultaneous  usage  by  all  processors,  A number  of  methods  were 
considered  to  reduce  the  probability  of  blockage  and  to  increase 
the  effective  throughput  through  the  network.  Three  of  these 
methods  will  be  summarized  below. 

The  "natural"  size  (in  terms  of  numbers  of  ports  on  each  side)  is 
a power  of  2,  Since  there  are  521  + spares  + Coordinator  connec- 
tions on  the  EM-side,  the  network  can  be  considered  to  be  a 1024  x 
1024  network.  This  additional  size  is  the  first  method  of 
reducing  blockage.  Half  of  the  processor-side  ports  are  unused 
and  slightly  less  than  half  of  the  EM-side  ports  are  unused. 
Thus,  there  is  immediately  a factor  of  two  reduction  in  the 
maximum  number  of  requests  for  service  to  the  network.  By 
spreading  the  active  elements  across  all  available  ports, 
potential  blockage  is  further  reduced  by  reducing  the  total  number 
of  nodes  in  the  network  where  blockage  is  physically  possible,  as 
explained  in  section  5,7,3  below. 

The  second  method,  a simple  duplexed  network,  requires 
approximately  twice  the  number  of  parts  than  the  network  just 
described.  In  this  case,  the  network  is  duplexed  (i,e,,  there  are 
two  copies)  in  order  to  provide  alternate  paths.  Then  requests 
that  may  be  blocked  on  one  Omega  netv^ork  may  find  a path  on  the 
second  (which  carries  only  those  request  blocked  on  the  first 
" layer " ) , 

The  duplexed  network  contains  exactly  twice  as  many  2x2  switch 
nodes  and  twice  as  many  node-to-node  connections  (one  set  on  each 
layer).  In  addition,  a small  amount  of  extra  routing  logic  is 
needed  on  the  processor-  side  and  a small  amount  of  arbiter  logic 
is  needed  on  the  EM-side  of  the  network, 

A third  method,  a duplexed  network  with  interlayer  paths  has  even 
less  blockage.  In  this  method  the  total  number  of  connections  in 
the  network  is  the  same  as  the  second  alternative  just  discussed. 
The  corresponding  pair  of  2 x 2 switch  nodes  (in  the  two  Omega 
networks  or  layers)  is  replaced  by  one  4x4  switch  node.  Connec- 
tivity is  provided  between  layers  at  each  node,  thus  greatly 
increasing  the  total  number  of  possible  paths  from  a processor- 
side  input  to  an  EM-side  output.  The  resulting  network  appears 
the  same  as  the  Omega  network  (Pig.  S,il)  but  each  connection 
drawn  actually  is  two  independent  connections  and  each  node  is  a 4 
X 4 switch  rather  than  a 2 x 2 switch. 
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A threefold  investigation  has  gone  into  the  optimization  of  the 
CN*  First/  a functional  simulator  was  written,  in  which  a variety 
of  test  cases  could  be  generated,  and  the  resulting  sets  of 
requests  submitted  to  the  simulated  CN  to  observe  the  behaviour. 
The  processors  in  this  simulation  had  a queue  of  up  to  five 
requests  each.  The  number  of  processors  making  a request  could  be 
varied.  There  was  provision  to  test  48  different  CN  design 
options. 

Second,  a statistical  evaluator  was  written,  in  which  the 
percentage  of  conflicts  for  random  permutations  on  the  inputs 
could  be  computed  for  a variety  of  different  EN  design  options. 
For  the  CN  option  that  they  both  handle,  namely  the  single-layer 
Omega  network,  the  evaluator  and  the  simulator  give  identical 
results. 

Third,  an  analytical  evaluation  of  the  CN  behaviour,  for 
particular  CN  design  options,  was  carried  out.  Each  of  these  is 
discussed  in  more  detail  in  the  Appendix  B and  Appendix  H. 

Either  the  simply  duplexed  network,  or  the  duplexed  network  with 
interlayer  ports  would  be  acceptable.  The  latter  has  the  least 
blockage,  but  a somewhat  higher  parts  count.  In  the  evaluations 
made,  both  the  simple  duplexed  network  and  the  duplexed  network 

with  interlayer  paths  had  100  percent  success  in  fetching  vectors 
in  two  of  three  directions.  The  simple  duplexed  network  had  a 

success  rate  of  1^  percent  in  the  third,  or  "hard”  direction  while 
the  other,  more  complex  network  had  a success  rate  of  87  percent 
in  this  case.  (Success  rate  is  defined  to  be  the  percentage  of 
requests  which  connect  immediately  to  EM-side  outputs  with  no 

blockage.  The  experiments  concerned  had  all  processors  active.) 
In  either  design,  if  vectors  with  preferred  skip  distances  are 
presented  to  the  network,  100  oercent  of  the  requests  are 
satisfied  immediately.  A skip  distance  of  one  is  always  satisfied 
100  percent.  Table  5.11  is  based  on  the  simple  duplexed  network. 

5.7.3  Processor  and  EM  Connection  Mapping 

The  Connection  Network  has  1024  ports  on  the  processor  side, 

numbered  from  0 through  1023,  and  likewise  on  the  EM  module  side. 
Because  potential  blockage  in  the  network  is  a function  of 
destination  address  and  the  origin  of  requests,  the  allocation  of 
processors  and  EM  modules  to  ports  of  the  network  becomes  an 
important  concern.  The  allocation  function  is  called  a mapping 
function.  The  mapping  function  serves  to  map  processor  number 

onto  input  port  number  and  EM  module  number  onto  output  port 
number . 


i 

1 

1 
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Table  Connection  Network  (CN)  Characteristics 


Number  in  System:  1 

Function 

To  serve  as  a dial-up  network  whereby  each  processor  can 
access  any  EM  module  in  a time  comparable  to  the  access  time 
of  the  EM  module.  To  serve  also  as  a broadcase  network  where- 
in the  coordinator  or  any  EM  module  can  broadcast  to  all 
processors.  To  serve  as  the  converse  of  broadcasting  in 
which  teh  coordinator  can  harvest  a single  word  from  all 
processors.  To  furnish  some  minimal  processor-to-processor 
communication. 

Mode  of  Operation 

Individual  2x2  switches  are  combined  into  a locally  control- 
led network.  Control  of  the  individual  2x2  node  is  gener- 
ated within  itself  from  the  signals  presented  to  it,  without 
regard  to  the  state  of  the  rest  of  the  network.  There  are  no 
latches  or  flip-flops  within  the  CN,  it  is  entriely  combina- 
torial logic. 

Storage  Capacities 

words 
ns  cycle 
technology 


To/ From 

Function  or  Name 

Signals 

Timing 

Comments 

Proc/coord 

Data  path,  processor 

24 

20ns/ frames 

513  such 

EM  mod./ 

side 

120ns  major 
timing 

connection 

Data  path,  EM  side 

24 

same 

522  such 

coord 

connection 

coord 

Control 

2 

Rel iability/Repairability/Trustwor thiness 

All  data  passing  through  the  CN  is  covered  by  SECDED, 
resulting  in  the  correction  of  single  transient  errors,  and 
the  detection  of  all  hard  errors. 

The  internal  redundancy  of  paths  will  provide  that  function 
continues  for  some,  but  not  all,  of  the  failure  modes  of  the 
CN. 
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Table  5-11.  Connection  Network  (CN)  Character istics  (Cont*d) 


Physical 

projected  chip  count: 
Size: 


39280 


Power : 


Port  = 32  X (EMno  MOD  512)  + 1 for  512^EMno^  527 

Within  each  cabinet,  for  the  256  ports  in  that  cabinet,  EM  modules 
are  attached  to  all  even  ports  0,  2,  4,  etc.,  through  254,  and  to 
odd  ports  1,  35,  39,  and  103.  In  four  cabinets,  there  are  512  + 
16  ports  thus  addressible,  allowing  up  to  seven  spares.  Any  spare 
can  be  used  in  place  of  any  failed  EM  module,  up  to  four  total 
limited  by  the  remapping  in  the  CN  buffer. 

Furthermore,  the  remapping  described  above  is  done  with  simple 
wired-in  shifting,  and  ORing.  The  substitution  of  spate  for  bad 
EM  module  is  done  by  substituting  one  EM  module  number  (521,  522, 
523,  or  524)  for  the  EM  module  number  of  the  failed  module.  The 
conversion  from  EM  module  number  to  port  number  is  fixed,  mostly 
just  by  wiring,  in  the  CN  buffer,  as  shown  in  Figure  5,12. 


5.7.4  Hardware  Aspects 

5. 7. 4.1  Clocks  and  Synchronization 

Requests  are  made  in  synchronism  with  the  ”CN  clock".  The  CN 
clock  is  a submultiple  of  the  processor  clock.  The  CN  clock  will 
be  synchronous  and  simultaneous  across  all  requesting  ports  (512 
processors  plus  coordinator).  The  acknowledge  from  EM  module  is 
received  within  a single  CN  clock  period,  since  the  CN  clock 
period  is  greater  than  the  roundtrip  delay  through  the  network. 
Since  EM  can  be  accessed  only  in  synchronism  with  the  CN  clock, 
the  EM  cycle  time  will  be  a multiple  of  the  CN  clock. 

Processor  clock  is  distributed  in  synchronism  to  all  processors. 
A signal  which  selects  every  mth  processor  clock  pulse  as  the  CN 
clock  is  also  distributed  from  the  clock  source,  but  the  timing 
reference  is  carried  on  the  processor  clock  itself. 

The  values  computed  from  projected  characteristics  are  40  ns  for 
processor  clock  period,  120  ns  for  CN  clock  period,  five  CN 
clocks,  or  600  ns,  from  the  beginning  of  one  request  for  read 
access  (hOADEM)  to  the  beginning  of  the  next  request  for  read 
access  from  the  same  processor,  given  that  there  are  no  blockages 
in  the  CN  itself.  For  store  access  to  one  EM  module,  the  CN 
buffer  must  wait  360  ns  before  accessing  any  other  EM  module,  for 
either  read  or  store. 

5. 7. 4. 2 Switch  Element 

Figure  5.13  shows  the  logic  in  one  switch  element.  The  control 
logic  occurs  once  and  the  sets  of  AND-OR  gates  are  each  repeated 
twelve  times  as  indicated  on  the  diagram. 


Figure  5.12  Mapping  of  EM 


Module  Number  to  CN  Output  Port  Number 


For  the  processors,  several  mappings  have  been  tried  or  proposed: 

1.  processors  0 through  511  attached  to  ports  0 through  511. 

2.  Same,  except  processor  number  is  bit-for-bit  the  reversal 

of  port  number.  That  is,  processor  number  110000000  is 
attached  to  port  number  000000011;  processor  1 is 

attached  to  port  number  256?  and  so  on. 

3.  Processors  0 through  511  connected  to  even  numbered  ports 
0 tnrough  1022. 

4*  Same  as  3,  except  for  the  bit-for-bit  reversal.  That  is, 
processor  110000000  is  attached  to  port  number 

0000000110;  processor  1 is  attached  to  port  number  512, 
and  so  on. 

5.  An  assignment  of  processors  to  ports  such  that  the 
connectivities  of  the  omega  network  will  make  connection 
cyclically  among  the  processors,  processor  N being  able 
to  transmit  to  processor  N+1. 

6.  A random  assignment  of  processors  to  ports. 

Similar  assignments  can  be  made  on  the  EM  module  side,  except  that 
the  EM  modules  from  number  512  to  number  520  must  be  allocated 
also. 

Mappings  1 and  2 can  be  eliminated  by  the  observation  that  all  the 
Processors,  or  EM  mdoules,  are  crowded  up  into  one  part  of  the 
network,  creating  additional  conflicts.  This  expectation  is 
validated  by  the  results  of  the  CN  simulator  using  these  mappings. 
Mapping  number  6 can  be  eliminated  by  the  argument  that  other 
mappings  give  much  better  results  for  the  frequently  used 
p-ordered  requests  and  p-q-ordered  requests  than  they  do  for 
random  requests.  The  best  operation  seen  with  the  simulator 
suggests  that  mappings  3 and  4 should  be  used,  one  on  either  edge 
of  the  network.  The  best  case  actually  simulated  was  processors 
using  mapping  4 and  EM  modules  using  mapping  3 on  the  simple 
duplexed  network.  Call  this  the  ‘‘baseline”  mapping  function. 

With  the  above  choice  of  mapping  functions,  the  known  frequently 
used  requests  are  serviced  with  100  percent  or  near-100  percent 
request  success,  and  random  cases  are  serviced.  The  simple 
duplexed  network  shows  an  average  of  77  percent  nonblocking  in  the 
network  for  random  requests.  The  duplexed  network  with  interlayer 
paths  shows  87  percent  nonblocking,  and  also  represents  the  rate 
of  request  success  seen  on  random  p-ordered  vectors.  Success  rate 
is  100  percent  on  requests  with  skip  distance  = 1.  Success  rate  is 
near  100  percent  on  p-q-ordered  vectors  with  skip  distance  = 1 
within  the  pieces  of  vector. 
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For  any  mapping,  there  is  a bad  case,  a permutation  in  which  only 
32  out  of  the  512  accesses  requested  are  granted  per  EM  cycle.  It 
is  desirable  that  this  case  be  one  that  is  not  expected  to  occur 
(note  that  the  bad  case  is  not  a catastrophe,  it  is  merely  a 
excess  access  time  for  one  memory  fetch).  For  mappings  3 and  4, 
the  bad  case  is  when  the  EM  accesses  desired  are  the  bit-for-^bit 
reversal  of  a sequential  index.  This  case  actually  occurs  once  in 
one  of  the  several  ways  to  program  a fast  Fourier  transform. 
Hence,  investigation  of  mappings  is  expected  to  continue, 
Including  mapping  No,  5,  which  moves  the  bad  case  to  some  more 
random  permutation,  and  allows  an  interesting  data  exchange 
pattern  for  the  SHIFCN  instruction.  However,  the  Fast  Fourier 
transform,  with  one  transform  executed  in  parallel  across  the 
array,  does  not  occur  in  any  aero  flow  code  or  weather  code 
evaluated.  The  FFT’s  in  one  weather  code  are  executed  serially, 
512  FFT*s  in  512  processors,  and  do  not  contain  the  bit-for~bit 
reversed  subscripted  parallel  fetch  request. 

It  might  be  noted  here  that  requests  within  the  Connection  Network 
refer  to  a CN  port,  not  to  a processor  or  EM  module  number.  There- 
fore all  mapping  must  be  done  external  to  the  CN.  Mapping  of  a 
processor  number  to  a port  j is  implications  only  for  the  wiring 
pattern  that  is  used  to  let  t ich  processor  know  its  own  number. 
Off-line  spare  processors  are  inhibited  from  making  requests  to 
anything  other  than  spare  EM  modules.  This  is  done  in  the  CN 
buffer  logic  of  the  processor.  In  addition,  the  CN  buffer  logic 
is  responsible  for  mapping  EM  module  number  to  CN  output  port. 
This  implies  that  the  provision  for  spare  EM  modules  must  be 
accommodated  in  the  remapping  from  EM  module  number  to  CN  port 
number,  since  the  ports  will  not  be  physically  relocated  when  a EM 
module  is  spared.  In  every  CN  buffer,  four  port  numbers  will  be 
caught  and  replaced  by  substitute  port  numbers. 

The  suggested  mapping  from  module  number  to  port  number  is  as 
follows; 

First,  put  the  most  significant  bit  of  EM  module  number  at  the 
least  significant  end  of  port  number.  This  gives 

Port  = 2 X EMno  for  0 EMno  511 

and  would  give 

Port  = 2 X (EMno  MOD  512)  + 1 for  512  ^ EMno  520 

This  last  formula  is  unacceptable  as  it  puts  all  nine  high-order 
EM  modules  into  the  first  cabinet.  Port  numbers  are  rigidly 
assigned  to  cabinets,  one  quarter  to  the  cabinet.  The  second 
formula  may  be  modified  as  follows: 
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The  simple  duplexed  network  would  be  packaged  as  follows; 

The  512-wide r 10-deep  by  2 layer  arrangement  of  nodes  can  be 
partitioned  into  2-wide^  2-deep  by  1 layer  subsets  in  which  every 
subset  is  like  every  other  subset*  A 1-bit-wide  slice  of  this 
subset  will  fit  on  a 24-pin  package  as  a single  chip  of  moderate 
complexityr  24  x 256  x 5 x 2 such  chips  will  implement  the  entire 
CN.  This  choice  yields  a total  of  57^440  packages,  all  identical, 
all  in  24-pin  pa^^kages*  One  observes  that  the  use  of  the  data 
lines  is  half  duplex,  not  full  duplex*  If  bidirectional  data 
lines  were  used,  a more  complex  chip,  handling  both  directions  of 
data  on  the  same  line,  would  still  have  the  same  pin  count*  Strobe 
and  acknowledge,  however,  could  not  be  combined.  The  result  would 
be  13  packages  per  node,  instead  of  25,  and  the  total  chip  count 
of  13  X 256  X 5 X 2 would  be  39,280. 

In  a 40-pin  package,  the  subset  two  nodes  wide  by  two  levels,  and 
both  layers  deep  could  be  accommodated,  so  that  exactly  half  as 
many  40-pin  packages  would  be  used,  or  28,720  packages  without, 
and  19,640  packages  with,  bidirectional  data  lines.  In  any  of  the 
four  cases,  the  control  logic  is  replicated  on  each  chip  to  reduce 
pin  count.  The  next-to-largest  of  these  various  projections  is 

used  in  Table  5,11  (which  shows  the  CN  Characteristics)  to  be 

conservative  without  complete  pessimism. 

A complete  new  chip  design  is  not  planned.  Rather  a gate  array 
implementation  is  likely. 

5. 7. 4. 3 Packaging 

Most  of  the  CN  is  packaged  within  the  EM  cabinets,  an  identical 

subset  of  the  CN  being  found  in  each  of  the  four  cabinets*  Note 
that  in  Figure  5.11  the  Omega  network  to  the  right  of  the  second 
level  of  logic  is  exhibited  as  four  identical  Omega  networks  of 

one  quarter  the  width.  Thus,  the  80  percent  of  the  CN  past  the 
first  two  levels  of  logic  is  found  in  the  EM  cabinets. 

(If  the  processor  cabinets  had  enough  room,  and  if  processor 

numbers  are  assigned  to  cabinets  in  the  correct  pattern,  the  same 
partitioning  of  80  percent  of  the  CN  to  processor  cabinets  can 

also  be  achieved.  An  interesting  puzzle  is  to  devise  those 

assignments  of  processor  number  to  cabinet  that  allow  all  of  the 
CN  to  be  distributed  among  the  processor  and  EM  cabinets,  with 

none  of  the  CN  assembled  in  any  one  central  location,  such  as 
colocated  with  the  coordinator  and  diagnostic  controller.) 

5.7.5  Design  Rationale  and  Changes  from  Preliminary  Study 

The  CN  seen  here  represents  a major  change,  and  a major 
improvement,  over  the  transposition  network  described  in  Ref.  1 
and  Ref.  2>  The  transposition  network  was  at  its  most  efficient 
only  for  512-long  vectors.  For  p-q-ordered  vectors,  the  access 

time  went  up  proportional  to  the  number  of  pieces  into  which  the 
vector  had  to  be  divided  (five  pieces  for  a 100  x 100  x 100 
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problem  in  the  third,  or  hard  direction) » Conditional  state- 
ments within  DOALLs  resulted  in  complex  code  in  those  processors 
that  were  trying  not  to  execute  anything?  they  had  to  pretend  to 
be  fetching  and  storing  to  EM  like  other  processors  in  order  to 
keep  the  synchronizations  straight*  Analysis  in  the  compiler  was 
therefore  also  complex* 

With  this  connection  network  all  these  complexities  disappear. 
Each  processor  is  completely  independent  of  any  other  processor. 
The  language  has  been  simplified,  since  restrictions  ^ on 

conditional  statements  and  labels  have  been  removed.  The  compiler 
has  been  simplified,  since  the  conditional  LOADEM  and  STOREM 
operations  are  no  longer  necessary,  and  a lot  of  address 
calculation  that  took  place  at  compile  time,  or  which  had  to  be 
allowed  for  in  the  old  control  unit,  is  not  needed  with  the 

present  connection  network. 

The  CN  chip  count  represents  another  cost/performance 
optimization.  For  performance,  a 516  x 5^8  crossbar  switch,  with 
no  conflicts,  and  all  accesses  being  granted  on  the  first  attempt 
at  request,  would  be  preferred.  However,  the  crossbar  switch  has 
275,088  crosspoints,  whereas  the  CN  has  40,960  crosspoints  (four 

in  each  2x2  node).  This  is  just  15.2  percent  as  many 
crosspoints,  reflecting  a large  ratio  in  hardware  also.  Despite 
this  huge  hardware  saving,  the  CN  has  100  percent  success  in 
fetching  vectors  in  two  of  the  three  directions,  and  a success 

rate  of  77  percent  (or  87  percent  if  the  alternate  design  is 

taken)  in  the  third,  or  '*hard”  direction. 

A second  optimization  of  speed  vs.  hardware  cost  occurs  in  the 
path  width  of  the  CN.  At  11  bits  per  frame,  we  need  a path  that 
is  12  signals  wide,  ^nd  takes  five  frame  times  to  transfer  a whole 
word.  At  20  ns  per  frame  this  means  that  the  delay  due  to 
serialization  of  the  data  word  is  an  additional  80  ns,  and 
dividing  address  into  two  cnaracters  adds  20  ns.  The  delay  due  to 
access  time  in  EM  is  on  the  order  of  200  ns  (actually,  it  is  yet 
to  be  determined,  and  the  recent  TI  announcement  of  a 64k-bir.  RAM 
makes  it  appear  that  EM  will  be  faster  than  projected  in  ref.  I). 
The  delay  due  to  round  trip  transfer  time  through  the  cables  and 
logic  of  the  CN  is  estimated  at  120  ns.  Thus  the  100  ns  added 
delay  due  to  serialization  of  address  and  data  is  small  compared 
to  the  320  ns  or  so  minimum  possible  access  time.  Tn  Reference  1, 
a path  width  of  8 cits  was  chosen  as  adequate.  This  has  been 
increased  to  11  bits  in  order  to  present  the  request  in  fully 
parallel  fashion?  the  request  being  the  port  number  on  the  EM  .-ide 
of  the  CN. 

A third  optimization  concerns  the  time  it  takes  to  compute  the 
control  of  the  CN.  With  unlimited  amount  of  time  for  computing 
the  setting,  the  Benes  networH<  can  produce  a set  of  paths  such 
that  all  processors  have  their  requests  granted  immediately,  100 
percent  of  the  time.  The  Benes  has  fewer  components  that  the  CN. 
Unfortunately,  we  are  trying  to  make  connections  in  nanoseconds. 
Opferman  and  Tsao-Wu,  Ref.  7,  show  that  the  amount  of  computation 
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required  to  find  a non-blocking  setting  for  a Benes  network  is  on 
H the  order  of  computational  steps,  or  Nlog  N if  an  associative 

memory  is  available.  This  is  certainly  intolerable  to  compute  at 
run  time  when  the  data  is  being  fetched,  and  in  our  opinion  is 
|[  intolerable  at  compile  time  also.  Furthermore,  the  computations 

impose  synchronization  onto  the  processors,  since  one  new  request, 
asynchronously  added  to  existing  set  of  latched  up  requests 
i requires  a whole  new  control  computation.  Hence,  we  have  opted  to 

“ search  for  suboptimum,  but  fast,  control  determinations,  having 

each  node  making  its  own  determination  of  its  own  setting  on  the 
, basis  of  locally  available  information  only,  and  ignoring  the  rest 

I of  the  CN. 

5.B  DATA  BABF.  MEMORY  (DBM) 

' Data  Base  Memory  (DBM)  is  the  window  in  the  computational  envelope 

of  the  FMP.  All  jobs  to  be  run  on  the  FMP  are  staged  into  DBM 
befcjre  running  both  program  and  data,  all  output  from  the  FMP  is 
staged  through  the  DBM.  DBM  can  be  used  by  the  programmer  to  back 
up  KM  Cor  those  problems  whoso  data  base  is  larger  than  EM. 
Control  of  the  data  base  memory  is  from  a DBM  controller, 
(described  in  the  next  section),  which  accepts  commands  both  from 
the  coordinator  for  transfers  between  DBM  and  KM,  and  from  the 
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The  general  organization  of  the  DBM  is  a controller  tojether  with 
a general  CCD  chip  array,  used  as  the  primary  storage  area,  a 
number  of  block*-sized  buffers,  used  for  speed  matching  on  data 
transfer  interfaces,  and  error  controls. 

The  design  described  here  is  based  on  the  assumption  that  256k~bit 
CCD  chips  will  be  arranged  in  the  form  of  128  shift  registers  of 
2,048  bits  each.  It  is  also  assumed  that  the  shift  rate  of  the 
devices  will  be  2.5  MHz. 
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Figure  5.14  Data  Base  Memory  Block  Diagram 
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DBM  files  come  from  and  are  moved  to  the  SPS  file  management 
system.  Over  99  percent  of  this  traffic  is  expected  to  bo  simple 
moves  from  DBM  to  disk  pack.  Twenty  M-bits/sec  on  this  path 
yields  large  safety  factors  over  the  traffic  actually  required, 
even  after  making  allowance  for  the  fact  that  short  jobs  will  be 
bunched  in  prime  time.  The  four  channels  provide  20  Mbits/sec 
with  5 MHz  disk  transfer  rates  and  40  mbits/sec  with  the  10  MHz 
rates  available  in  recently  announced  products. 

No  buffering  is  needed  on  the  HM  side  beyond  the  one--word  buffers 
in  each  EM  modi’''o.  These  one-word  buffers,  and  the  240  ns  cycle 
time  of  the  EM  modules,  together  ensure  that  the  DBM  controller 
never  need  wait  for  an  EM  response, 

DBM-BM  transfers  have  priority  over  CN  servicing  in  the  EM 
controls.  However,  there  is  little  interference  with  processor 
accesses  to  RM.  For  example,  when  transferring  from  EM  to  DBM, 
one  EM  cycle  loads  512  of  the  pet -EM-module  one-word  buffers,  and 
then  waits  for  12,8  microseconds  before  another  EM  cycle  is 
required  for  the  DBM  transfer  path. 

Table  5*12  summarizes  the  characteristics  of  the  DBM, 


5,8,2  Soft  Error  Control 

As  a background  job,  the  DBM  controller  periodically  initiates  an 
access  for  the  purpose  of  reading  the  contents  of  a block  and 
rewriting  that  same  block  with  all  detectable  errors  corrected, 
since  errors  are  spontaneously  cre^  :ed  in  CCD  memories  at  a low 
rate.  These  errors  are  apparently  caused  by  background  radiation 
effects  on  the  CCD  chips,  discharging  the  little  capacitors  by 
temporarily  ionizing  the  oxide.  The  rate  of  periodically 
initiating  access  can  rationally  be  determined  only  after  getting 
the  vendor’s  specification.  Preliminary  Fairchild  data  Indicates 
that  one  should  scrub  through  the  entire  DBM  every  seven  minutes. 

At  that  rate,  this  background  access  would  be  initiated  for  a new 
block  every  55  ms.  Error  scrubbing  accesses  will  not  queue.  If 
one  is  delayed  beyond  its  55  ms  time  slot,  then  the  whole  cycle 
will  slip  to  7 minutes  plus  55  ms. 

5 ^ ^ Design  Rationale  and  Changes  from  Prelim^ *^^ry  Study 

The  major  change  from  Ref,  1 & Ref  2 was  the  reorganization  of  the 
internal  structure  of  the  DBM  CCD  storage  array  to  allow  higher 
bandwidths  to  and  from  the  EM  modules  and  to  and  from  the  file 
storage  system. 


A 


Table  5-12*  Data  Base  Memory  (DBM)  Characteristics 


Number  in  System;  1 
Function 

To  serve  as  staging  area  for  FMP  jobs;  to  serve  as  memory 
extension  for  FMP  jobs  that  will  not  fit  into  RM  an<l  PMs. 

Mode  of  Operation 

Stores  in  blocks  only*  Has  access  to  support  processor  file 
system  on  the  one  side,  and  to  the  EM  on  the  other  sid^^*  DHM 
areas  may  be  used  by  the  file  system. 


Storage  Capacities 


134/217/728 

words 

131072 

words 

400 

ns  cycle  shift  rate 

plus 

280ns 

cycle 

256k-bit  CCD 

technology 

64k-bit 
dynamic  MOB 

RAM 

Connectivities 

No . 

To/ From 

Function  or  Name 

Signals 

Timing 

Comments 

Support  Processor 

Data  channel 

TBD 

40  mega- 
bits/sec 

EM 

Data  channel 

TBD 

40  megabits/ 

sec. 

DBM  controller 

Control 

TBD 

TBD 

Reliability/Repairability/Trustvorthiness 

All  words  covered  by  error  correcting  code. 

Errors  are  periodically  removed  by  reading^  doing  error 
correction/  and  rewriting. 

Sections  of  DBM  can  be  locked  out  by  software/  so  that 
function  can  be  provided  by  the  remaining  working  portions. 

Physical 

Projected  chip  count:  29160  (28160  memory  + 1000  control 

and  misc.) 

Size:  176  boards  of  166  chips  each 

Power:  lOkw  operating/  1 kw  standby 


M 


I 
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Two  major  data  transfer  paths  exist,  one  to  the  EM  and  one  to  the 
disks  of  the  Pile  System*  The  desired  transfer  rate  to  and  from 

the  Extended  Memory  (EM)  is  40  M words/sec*  To  accomplish  this, 

the  DBM  storage  area  will  be  organized  440  chips  wide  for  parallel 
emission  of  eight  55  bit  words  by  64  chips  deep* 

The  natural  block  size  with  2,048  bits  in  each  shift  register,  the 
eight  words  in  parallel  delivering  a block  of  16,384  words,  is 

adopted*  There  are  8k  blocks  for  a total  of  134,217,728  words. 
Error  correction  is  a SECDED,  probably  the  modified 
Hamming-plus-par ity  implemented  by  Motorola*s  10,163  chip* 

Since  the  array  of  CCD  chips  is  64  x 440,  the  DBM  is  constructed 
in  a number  of  physical  modules?  each  one  8 x 440  chips*  Cards 
are  20  bits  wide,  22  cards  per  module.  The  repair  philosophy  is 
to  pull  and  replace  individual  cards,  and  the  degraded  mode  of 

operation  would  be  to  run  with  one  or  more  modules  missing,  and 
the  operating  system  would  have  to  be  told  to  avoid  assigning  any 
data  to  that  space. 

There  are  eight  block-sized  buffers,  which  stand  between  the  CCD 
storage  and  the  host  interface,  in  order  to  reduce  the 

interference  with  DBM-EM  transfers  produced  by  simultaneous 

DMB-file  system  transfers.  They  also  serve  as  timing  buffers  to 
the  file  system’s  disk  packs,  eliminating  the  need  for  block  sized 
buffers  elsewhere  in  the  data  channel.  These  buffers  are 

contained  in  two  memory  modules  constructed  of  the  64k-bit  dynamic 
RAM  chips  used  in  the  EM  modules. 

After  the  transfer  of  a block  to  or  from  the  CCD  store,  the  shift 
registers  rest  at  the  starting  position  until  shifting  is  required 
by  the  refresh  requirements,  or  until  the  CCD  store  is  again 
addressed,  whichever  occurs  first*  The  store  will  be  periodically 
addressed  for  error  control  reasons,  see  5*8*2  below.  Therefore, 

whenever  there  are  several  requests  for  transfer  pending  at  once, 
or  when  they  occur  with  sufficient  frequency,  the  access  time  is 
essentially  zero  to  the  first  word  of  the  block.  For  transfers 
arriving  at  random  times,  far  enough  apart  in  time  so  as  not  to 
interfere,  the  average  access  time  is  given  by: 

Tav  - 1/2  (Tb^/T^,) 

where  Tb  is  the  transfer  time  of  a single  block  (0*82  ms)  and  T^^ 
is  the  time  between  refreshes.  will  be  in  the  specification  of 
the  device,  and  is  expected  to  lie  between  1 ms  and  10  ms.  There- 
fore, the  average  access  time  for  random  data  at  low  usage,  to  the 
first  word  of  the  block,  has  an  upper  bound  which  is  expected  to 
lie  between  0*67  ms  and  0,067  ms.  As  traffic  increases,  the 
access  time  is  mostly  due  to  interference  between  competing  acces- 
ses, while  the  contribution  due  to  delay  in  the  memory  goes  to 
zero. 
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The  DBM  design  seen  here  is  the  result  of  comparing  a number  of 
different  devices.  The  other  possibilities  include: 

Magnetic  bubbles.  Rejected  because  the  bandv/idths  would 
require  the  reading  and  writing  of  thousands  of  bubble  chips 
in  parallel,  and  also  because  of  the  inherently  greater 
complexity  of  bubble  systems.  Each  bubble  chip  requires 
several  support  chips  such  as  drivers,  sense  amplifier,  etc. 

Rotating  magnetic  storage.  With  enough  heads  in  parallel, 
and  a fast  enough  rotation  rate,  magnetic  rotating  storage 

can  supply  the  DBM  requirements.  However,  the  programming 

becomes  complicated  by  considerations  of  data  organization 
and  access  time.  Blocks  want  to  be  very  large  to  amortize 

the  large  access  times  over  the  high  transfer  rate 
requirements.  For  example,  to  get  full  transfer  rate  from  a 
10  ms  disk  requires  blocks  that  cover  the  entire  track,  or 

blocks  10  ms  long.  If  full  transfer  rate  is  40  million  words 
per  second,  the  blocks  are  almost  half  a million  words  each. 
For  some  purposes  this  is  a severe  restriction. 

64k-bit  CCD*s.  64k*bit  dyanmic  RAMs  will  be  preferred  by 
almost  all  equipment  designers  over  the  shift  register  CCDs. 
With  the  recent  appearance  on  the  market  of  dynamic  RAMs,  it 
is  to  be  expected  that  the  64k-bit  CCDs  will  disappear . 

64k“bit  dynamic  RAMs.  These  would  make  an  acceptable  back-up 
DBM  design.  With  the  increased  cost  would  come  a measure  of 
increased  performance  and  freedom  from  the  hardware-defined 
block  structure. 

One  last  possibility  should  be  mentioned  for  tl^e  future.  The  same 
device  fabrication,  tooling  and  lithography  te^'hniques  which  are 
expected  to  allow  the  development  of  256k-bit  CCD  chips  can  be 
expected  to  result  in  256k-bit  dynamic  MOS  RAM  chips  within  a year 
after  the  CCD  chips  are  available.  Enough  advantages  may  accrue 
from  the  use  of  these  chips  in  terms  of  increased  performance  and 
freedom  from  a fixed,  hardware-defined  block  structure  that  these 
RAM  chips  would  be  used  in  the  DBM  design. 

5.9  DATA  BASE  MEMORY  (DBM)  CONTROLLER 

The  DBM  controller  interfaces  two  environments,  the  FMP  internal 
environment  and  the  file  system,  since  the  DBM  is  the  window  in 
the  computational  envelope.  DBM  allocation  is  under  the  control 
of  the  file  management  function  of  the  support  processor.  The  DBM 
controller  has  a table  of  that  allocation,  which  allows  the  DBM 
controller  to  convert  names  of  files  into  DBM  addresses.  When  the 
file  has  been  opened  by  an  FMP  program,  it  is  frozen  as  far  as 


allocation  is  concerned , and  must  remain  resident  in  DBM  until 
either  closed  or  abandoned.  For  open  files,  the  DBM  controller 
accepts  descriptors  from  the  coordinator  which  call  for  transfers 
between  DBM  and  EM.  These  descriptors  contain  absolute  EM 
addresses,  but  file  names  and  record  numbers  for  the  DBM  contents. 

The  DBM  controller  therefore  has  two  main  elements.  First,  a 

programmable  controller  and  second,  hardwired  channel  logic  to 
accommodate  the  data  transfers. 

The  software  response  time  of  the  DBM  controller  shall  be  less 

than  100  microseconds  to  Coordinator  requests.  This  demands  that 
the  conversion  from  file  name  to  address  be  simple  table  lookup, 

and  also  that  the  response  of  the  DBM  controller  to  Coordinator 

commands  be  essentially  instantaneous;  i.e.,  either  the  normal 
state  of  the  DBM  controller  is  waiting  for  an  Coordinator  command, 
or  Coordinator  commands  have  a priority  interrupt  within  the  DBM 
contr  oiler . 

The  actual  channel  controls  for  transferring  a block  of  data  are 
independent  of  the  controller  that  does  the  table  lookup  and 
handles  the  exception  conditions.  Address  counters,  limit 

registers,  and  limit  comparators  are  separately  implemented,  not 
programmed,  because  of  the  high  transfer  rates  involved.  There 
are  five  such  channel  controls,  one  per  host  channel,  and  one  for 
the  EM  interface.  The  entire  bandwidth  of  the  EM  channel  is 

devoted  to  whatever  single  transfer  is  being  effected  at  a given 
time. 

Operation  is  as  follows.  When  an  FMP  task  has  been  requested,  the 
support  processor  passes  to  the  file  manager  the  names  of  the 
files  needed  to  start  that  task.  In  some  cases  existing  files  are 
copied  into  newly  named  files  for  the  task.  When  all  files  have 
been  moved  into  DBM,  the  task  starts  in  the  FMP.  When  the  task  in 
the  FMP  opens  any  of  these  files,  the  allocation  will  be  frozen 
within  DBM.  It  is  expected  that  "typical”  task  execution  will 

start  by  opening  all  necessary  files.  During  the  running  of  a FMP 

task,  other  file  operations  may  be  requested  by  the  user  program 

on  the  FMP,  such  as  creating  new  files  and  closing  files. 

EM  space  is  allocated  either  at  compile  time  or  dynamically  during 
the  run.  In  either  case,  EM  addresses  are  known  to  the  user 
program.  DBM  space,  on  the  other  hand,  is  allocated  by  the  file 
manager,  which  gives  a map  of  DBM  space  to  the  DBM  controller.  In 
asking  the  DBM  controller  to  pass  a certain  amount  of  data  from 

DBM  to  EM,  the  Coordinator,  as  part  of  the  user  program,  issues  a 
descriptor  to  the  DBM  controller  which  contains  the  name  of  the 
DBM  area,  the  absolute  address  of  the  EM  area,  and  the  size.  The 
DBM  controller  changes  the  name  to  an  address  in  DBM.  If  that 
name  does  not  correspond  to  an  address  in  DBM,  an  interrupt  goes 
back  to  the  Coordinator,  together  with  a result  descriptor 
describing  the  status  of  the  failed  attempt. 


Not  all  files  will  wait  to  the  end  of  an  FMP  run  to  be  unloaded. 
For  example^  the  number  of  snapshot  dumps  required  may  be  data 
dependent^  so  we  may  wish  to  create  a new  file  for  each  one,  and 
certainly  we  shall  want  to  close  the  file  containing  a snapshot 
dump  so  that  the  file  manager  can  unload  it  from  DBM,  When  the  FMP 
task  terminates  normally,  all  files  that  should  be  saved  will  have 
been  closed  by  the  FMP  program.  The  strategy  that  supports 
restart  has  not  been  detailed. 

The  file  manager  may  choose  to  leave  read-only  files  in  place  in 
DBM,  on  the  chance  that  the  same  read-only  file  may  be  asked  for 
by  more  than  one  task. 

5.10  DIAGNOSTIC  CONTROLLER  (DC) 

The  diagnostic  controller  provides  a channel  whereby  the  Support 
Processor  or  logician  at  the  maintenance  panel,  can  Impose 
diagnostics  upon  the  FMP.  The  strategy  behind  the  diagnut-t ics  is 
that  any  portion  of  the  FMP  can  be  set  to  some  arbitrary  state, 
and  then  caused  to  execute  some  fixed  function  or  execute  for  some 
fixed  amount  of  time,  and  that  the  resulting  state  can  be 
observed.  The  Diagnostic  Controller’s  access  is  dir«?ct  to  the 
coordinator.  Access  to  the  processors  is  indirect,  in  that  the 
coordinator  has  direct  access  to  the  processors,  and  the  diagnos- 
tic controller  manipulates  the  coordinator.  Chapter  5,  in  dis- 
cussing the  diagnostic  programming,  discusses  these  relationships 
in  more  detail. 

The  output  of  the  diagnostic  controller  is  a set  of  commands  to 
the  Coordinator  and  the  DBM  controller.  These  commands  are  yet  to 
be  determined  in  detail  but  they  are  of  the  general  type  of  the 
following  examples: 


LOAD  REGISTER  R 
READ  REGISTER  R 

EXECUTE  the  instruction  presently  residing  in  the  program 
register  and  then  halt 

HALT  all  operations,  possibly  by  suspending  the  clock 

INITIALIZE  a predetermined  subset  of  registers  to  a 

predetermined  state  (probably  all  zeroes) 

The  input  of  the  diagnostic  controller  comes  from  either  the 
support  processor  or  from  a maintenance  terminal.  The  inpu^  can 
cause  the  diagnostic  controller  to  emit  single  commands,  or  to 
emit  a series  of  preprogrammed  commands.  In  order  to  emit  mec*ning- 
ful  sequences,  and  to  collect  the  results  of  those  sequences,  it 
is  envisioned  that  the  diagnostic  controller  contains  a mini  or 
microprocessor.  A test  control  language  will  be  provide<i. 
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The  diagnostic  controller  is  a debugging  aid,  a system  integration 
aid,  and  is  used  only  as  a fall-back  mode  of  operation  during 
maintenance*  System  initialization,  upon  power  up  or  other  cold 
start  of  the  PMP  may  also  use  some  of  the  DC  capabilities  for 
initialization  of  selected  registers  and  loading  of  bootstraps. 

5.11  POWER  CONSIDERATIONS 


The  power  supply  design  for  the  FMP  will  consider  the  following; 


- A small  number  of  centralized  power  conditioning  modules 
that  accept  raw  AC  power  from  the  mains. 


- Switching  regulators  for  efficiency 

- Defense  against  faults  in  the  incoming  power 
--  Defense  against  faults  in  the  FMP 

- Noise  reducing  grounding  methods. 

- Non-volatility  of  DBM  contents 

A power  supply  system  that  takes  all  these  features  into  account 
is  described  in  this  section. 

Total  power  for  the  FMP  is  estimated  at  250  kw,  based  on  an 
average  of  0.8w  for  each  of  the  200,000  circuit  packages  and  on  65 
percent  efficiency  in  the  power  supply  system. 
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5.11.1  AC  Modules 


The  block  diagram  of  the  power  supply  system  is  shown  in  Figure 
5.15.  Raw  AC  power  is  supplied  to  six  places  (labelled  ”1"  in  the 
figure),  namely: 

- The  maintenance  panel,  which  also  contains  the  central 
power  system  control 

- The  DBM  power  system 

- Four  identical  AC  modules. 

Each  of  the  six  places  to  which  raw  AC  input  is  supplied  contains 
an  AC  voltage  monitor.  The  design  intent  is  to  shut  the  machine 
down  for  high  line  or  low  line  that  is  potentially  damaging  to  the 
machine,  and  to  send  a one-bit  message  to  the  maintenance  panel 
and  the  support  processor  for  low-line  conditions  that  may  cause 
garbling  of  data. 
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Out  of  the  maintenance  panel’s  power  system  comes  various  DC 

voltages  {labelled  ”3")  for  the  maintenance  display  and  the 

central  power  control.  These  include  +5  at  20  amps  for  logic  and 
LED  drivers,  +12  at  0,5a,  and  -12  at  la,  plus  a switched  IfS  115v 
AC  for  the  CRT  which  has  its  own  self-  contained  supply. 

The  AC  modules  receive  ”turn-on”,  "turn-off”  signals  from  the 

central  power  system  control,  and  send  "fault"  signals  back  to  the 
central  control.  Each  AC  module  supplies  up  to  250  a at  158  volts 
DC  (labelled  "2")  for  the  switching  regulators  attached  to  it. 

The  AC  module  also  combines  the  fault  signals  from  its  attached 
power  supplies  into  a "cabinet  fault"  line,  and  shuts  down  for  any 
perceived  faults.  It  contains  line  filters.  In  complexity,  it  is 
similar  to  the  AC  module  of  the  B-6700.  Power  efficiency  is 

between  96  percent  and  99  percent. 

The  requirement  that  the  PMP  power  system  ride  through 

undervoltage  transients,  and  tolerate  voltage  spikes  from  the 
mains,  influences  the  design  of  the  power  control  modules,  A 

transformerless  rectifier  in  the  central  power  control  module, 

with  switching  regulators  distributed  around  the  FMP,  is  a system 
inherently  tolerant  to  under voltage  sags  and  transients,  and 
impervious  to  spikes.  In  addition,  the  switching  transients  of 

the  regulators  tend  to  be  soaked  up  by  the  filter  capacitors  at 

the  control  module's  rectifier.  Whether  either  a motor-generator 

set  or  battery  backup  is  needed,  would  depend  on  actual  line 

characteristics  at  Ames,  If  the  line  characteristics  are  known 

before  the  design  is  carried  out,  the  system  can  be  designed  so 
that  the  expense  and  inefficiency  of  the  motor -generator  set  can 
be  eliminated. 

The  DBM  power  unit  provides  30  amps  at  158v  for  the  DBM  controller 
logic  supplies  {labelled  ”5"),  and  a separate  line  ("4")  for  35 
amps  at  158v  for  the  memory  chips  of  the  DBM,  There  is  also  a 
stand-by  mode  in  which  8 amps  at  158v  is  supplied  to  the  memory 
cabinet  from  batteries  during  power  outages  of  up  to  15  minutes. 
(15  minutes  is  selected  on  the  basis  that  that  is  long  enough  to 
save  all  of  DBM  on  a single  disk  pack  through  a single  disk 
channel.  The  resulting  316  watt-hour  requirement  can  possibly  be 
supplied  by  an  ordinary  sealed  lead-acid  battery.)  The  DBM  power 
unit  also  contains  logic  to  handle  fault  situations,  and  the  same 
line  filters  that  are  in  the  AC  modules, 

5.11*2  Other  Power  Supplies 

Besides  the  seven  units  described  briefly  above,  there  are  within 
the  cabinets  the  following: 

516  processor  power  supplies  each  contained  physically 
within  its  own  processor.  Each  one  is  a 70  percent  or 
better  efficient  switching  regulator  supplying  40  amps  at 
5v,  and  0,5  amps  at  -12v, 
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44  supplies  at  5v,  160  amps*  Except  for  the  lower  power 
levels  those  are  similar  to  power  supplies  built  by 
Burroughs  for  PEPE*  There  are  eight  in  each  EM  cabinet, 
and  four  in  the  Coord  inator^-connect  ion-network  cabinet, 
and  eight  each  in  the  DBM  controller  and  the  DBM  memory 
cabinet.  These  are  switching  supplies  at  75  percent 
efficiency. 

6 supplies  at  12v,  2 also  working  from  the  158v  out  of  the 
AC  modules.  These  are  used  in  Coordinator,  DBM 
controller,  and  EM  cabinets  for  +12v  and  -12v  for  various 
purposes . 

Each  of  the  supplies  above  contains  reim^te  voltage  sensing, 
approf)riate  over-current  sensing,  current  limiting  or  fold-back, 
over-voltage  and  under-voltage  sensing. 

‘ ^ 1 ^ f'^ro unding  Considerations 

Grounding  is  an  area  of  design  in  which  evf?n  qualified  electrical 
anci  t.'lectronic  engineers  sometimes  propagate  false  myths.  Some  of 
the  confusion  is  due  to  failing  to  distinguish  between  various 
functions  of  the  conductors  called  "ground*’,  which  in  any  given 
case  may  or  may  not  be  at  the  same  voltage,  and  may  or  may  not  be 
the  same  conductor.  Some  functions  are: 

- Neutral  in  an  AC  distribution  system. 

- Earth,  or  an  external  7,ero  voltage  reference. 

- Safety  ground,  enclosing  the  equipment  in  order  to  prevent 
shock  hazards. 

Shields,  enclosing  electrically  active  circuits  in  order  to 
prevent  transmission  or  reception  of  interfering  electro- 
magnetic signals. 

- Reference  voltage.  The  signal  voltages  in  the  equipment  are 
measured  with  respect  to  the  reference  voltage.  Reference 
voltage  is  often  called  "logic  ground". 

- Return  paths  for  currents. 

Some  details  resulting  from  these  considerations  are: 

The  ground  return  from  backplane  to  power  supply  is  never 
used  as  part  of  the  path  that  connects  one  backplane  ground 
to  another  backplane  ground. 

Every  module  has  its  logic  ground  tied  to  chassis,  so  that 
thvere  will  be  no  floating  grounds  when  the  modules  are 
tested  as  stand-alone  modules.  These  ^ies  may  be  resistors 
if  unwanted  ground  currents  would  be  set  up  by  direct 
connections . 

Every  single-ended  signal  which  traverses  from  the  area 
one  backplane  to  another  is  accompanied  by  a wire  conductor 
for  the  return  current  of  that  signal,  and  the  re  tup  con- 
ductor is  connected  to  reference  voltage  at  all  points  at 
which  the  sigr.al  is  either  generated  or  used. 
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5.12  CIRCUIT  AND  PACKAGING  TECHNOLOGY 

5.12.1  IMPLEMENTATION  TECHNOLOGY  UPDATE 

5.12.1.1  SUMMARY 

The  semiconductor  industry  has  continued  to  improve  both  device 
density  and  performance  since  the  previous  implementation  technol- 
ogy submission.  Smaller  device  geometries  have  been  achieved  in 
production  with  the  application  of  Electron  Beam  processing  tech- 
niques. The  initial  utilization  of  the  E Beam  tool  has  been  in 
the  mask  generation  area  where  smaller  geometry  and  more  rapidly 
generated  masks  have  been  produced.  This  advantage  coupled  with 
projection  exposure  of  wafers  as  compared  to  the  use  of  contact 
masks  and  plasma  or  dry  etching  has  enabled  higher  precision 
devices  to  be  generated  in  a production  environment.  Line  widths 
are  predicted  to  diminish  to  under  one  micron.  The  priority  of 
devices  to  which  the  new  processing  technology  is  applied  has  been 
first  in  the  memory  area  and  second  in  the  mic*^oprocessor  area. 
Microprocessor  availability  in  the  16  bit  logic  density  area  has 
increased  from  just  a few,  to  a selection  of  a half  dozen  or  so 
with  performance  estimated  to  be  in  the  PDP  11/45  class  or 
greater.  Direct  address  capability  has  expanded  from  a 16  bit 
limitation  of  65K  to  a 16  megabyte  level. 

During  the  initial  manufacture  of  large  (65K)  CCD  Memories  a 
higher  than  expected  random  failure  rate  was  observed.  The 
failure  mechanism  was  later  identified  as  being  caused  by  alpha 
particles  which  modified  the  charge  being  transported,  thus 
destroying  the  information  stored  in  the  memory.  Solutions  were 
developed  for  greatly  lowering  this  failure  ate  by  reducing  or 
eliminating  the  major  source  of  alpha  particles  and  providing  a 
shield  layer  on  the  chip.  The  major  source  of  alpha  particles  was 
reported  to  be  in  material  used  to  package  the  CCD  chips. 

In  the  very  high  speed  area,  gate  arrays  were  becoming  an 
interesting  alternative  for  achievement  of  dense  logic  implemen- 
tation. The  economy  of  using  gate  arrays  is  dependent  on  quantity 
of  the  devices  required  for  the  systems  to  be  produced.  Basic 
arrays  exist  at  Motorola  and  Fairchild  in  the  nigh  speed  ECL  area. 
A gate  array  exists  in  the  proprietary  Burroughs  CML  circuit 
family  (BCML). 

Memories  anticipated  to  be  available  in  the  1979/1980  time  frame 
have  already  been  delivered  on  a sample  basis  to  selected  manu- 
facturers. These  include  the  65K  dynamic  RAMS  and  16K  static 
RAMS.  CCD  65K  bit  memory  circuits  have  been  delivered  for  incor- 
poration into  CCD  memory  modules.  Work  has  begun  in  definition  of 
256K  bit  CCD  and  256K  bit  RAM  with  expectations  of  availability  in 
the  1980/81  time  frame. 
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Gallium  arsenide  efforts  in  the  hi*jh  speed  sub-nanosecond  logic 
area  have  continued  at  a number  of  manufacturers*  facilities. 
Specification  circuit  configurations  for  the  gallium  arsenide 
MESPETS  are  being  reviewed  along  with  development  of  production 
procedures  to  manufacture  these  devices.  Speed  power  estimates 
vary  from  100  picosecond  propagation  delay  range  to  about  300 
pico-seconds  with  power  dissipations  varying  from  about  .08  - .3 
milliwatts  per  gate. 

Effort  is  being  expended  in  utilization  of  the  CMOS  SOS  type  of 
circuit  implementation.  At  the  Solid  State  Circuits  Conference  in 
1978  the  general  discussion  seemed  to  indicate  that  CMOS  SOS  gate 
density  problems  would  be  somewhat  overcome  with  the  tighter  line 
width.  The  attractiveness  of  the  CMOS  SOS  circuit  for  NASF 

applications  is  the  projected  lower  power  dissipation  of  gates  not 
memory  in  the  CMOS  LSI  circuit. 

The  specific  implementation  approach  to  be  selected  for  the  NASF 

FMP  must  be  postponed  as  long  as  possible  due  to  the  dynamic 

developments  occurring  in  the  semiconductor  techno”*  *^y  area.  At 
present,  the  bipolar  BCM  or  CML  candidates  look  the  most  promising 
from  a performance/risk  point  of  view.  Although  developments  in 
higher  speed  gallium  arsenide  devices  are  progressing,  the  risk 

involved  in  such  an  Infant  technology  does  not  seem  to  warrant  the 
advantages  gained  in  higher  speed. 

During  the  current  contract  some  additional  information  in  both 
ECL  arrays  and  BCML  circuits  has  been  reviewed.  Some 

char acter isitics  of  the  ECL  voltage  compensated  arrays  as  well  as 
information  on  BCML  are  included  In  the  following. 

5.12.1.2  ECL  Arrays 

Motorola  has  announced  the  MECL  lOK  Macroceli  Array  that  consists 
of  48  macro  cells  with  32  interface  circuits  and  28  output 
circuits.  All  cells  can  have  series  gating.  Structured  cells  are 
predefined  into  logic  elements.  Interconnect  channels  are  12  x 12 
for  9 macro  cells.  The  macro  cells  consist  of  functional  circuits 
which  are  interconnected  to  produce  larger  portions  of  logic.  The 
total  number  of  channels  available  for  interconnection  is 
approximately  108  x 94.  lnter»ral  gate  delays  anticipated  are 

approximately  900  picoseconds.  A maximum  of  1.3  nanoseconds  is 
expected.  The  maximum  power  dissipation  if  all  cells  are  used  is 
anticipated  to  be  approximately  4 watts.  An  equivalent  gate 

complexity  up  to  750  gates  can  be  realized  on  the  array.  The 

average  gate  power  is  projected  to  be  5.3  milliwatts.  High  drive 
outputs  can  be  achieved  at  8 of  the  interfaces.  A capability  of 
driving  a 25  ohm  line  exists  at  these  points.  The  die  size  is 

approximately  210  x 230  mils.  It  is  anticipated  that  the 
semiconductor  *^hips  will  be  placed  in  a 68  pin  leadless  package. 
Some  proposed  connectors  exist  for  the  68  pin  package. 
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5,12.1.3  BCML 


II 

j 

!l 


The  BCML-2  (Burroughs’  Current  Mode  Logic)  family  is  a Burroughs 
developed  circuit  family  intended  for  use  in  Bunoughs’  systems. 
The  family  consists  of  SSI^  MSI  and  LSI  circuit  types^  gate 
^ arrays,  register  files,  ROM,  RROM,  EAROM,  and  RAM.  All  devices 

have  on-chip  voltage  and  temperature  compensation.  This  assures 
constant  logic  levels  and  constant  threshold,  hence  constant  noise 
I mar o ins.  It  also  assui es  constant  propagation  delay  over  the 

entire  operating  voltage  and  temperature  ranges.  Two  types  of 
power  supply  are  specified.  Logic  circuits  use  -2,7V  + 30%  or 
i “4,8V  + 25%,  while  memories  use  only  -4.8V  + 25%.  All  devices 

• have  on-chp  output  resistors  which  serve  to  sour ce-terminate  50 

ohm  transmission  lines.  On-chip  Test  and  Diagnostic  (T&D) 
j monitors  are  used  to  detect  opens  and/or  shorts  of  any  logic  net 

! and  loss  of  power  supply  voltages  to  any  circuit  chip. 

The  salient;  features  of  the  BCML  family  are  given  in  Table  5.13. 

5.12,2  Packaging 

5 . 1 2 ♦ 2 • 1 General 

Final  choices  of  packaging  technology  can  be  deferred  until  the 
system  design  is  nearly  complete.  However,  for  performance  and 
reliability  analysis,  scheduling  and  cost,  preliminary  selections 
must  be  made,  Basic  high  speed  (RCL)  packaging  technology  has 
been  developed  over  the  past  decade  thac  provides  high  performance 
and  reliability  at  quite  reasonable  cost.  The  manuf actur ing 
tooL:>,  and  assembly  and  test  procedures,  are  all  fully  developed. 
Tnis  technology  includes  a family  of  specified  and  use  qualified 
components  and  hardware.  Advances  in  this  area  are  under 
continual  study.  The  current  status  and  performance 
characteristics  of  this  technology  are  discussed  in  the  following 
sections, 

5,12.2*2  Printed  Circuit  Assemblies 

Multi-layered  printed  circuit  assemblies  provide  a straightforward 
approach  for  the  packaging  of  standard  commorcial  duai-in-3 ine  ECL 
circuits.  The  six  layer  16  inches  by  18.5  inches  assembly  used  by 
Burroughs  on  the  PE^E  and  latter  programs  provided  a capability  of 
mounting  300  sixheefi  pin  dual-in-line  packages  or  280  sixteen  pin 
and  10  twenty  four  pin  packages.  The  board  consists  of  six  copper 
layers  permitting  two  signal  layers,  two  voltage  layers  and  two 
ground  layers.  (Figure  5.16),  Each  signal  layer  references  a 
ground  plane  providing  two  layers  of  50-ohm  microstrip.  Proper 
tolerance  is  maintained  over  line  width,  dielectric  spacing  and 
dielectric  constant. 
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Table  5-13 


PWibRKS  OP  BURROUGHS  CML.  CIRCUIT  FAMILY 


Hiqh  spec* I - 0,7ns  per  raw  gate 

Low  power  product  4 p p^'r  internal  gate  for  LSI  ^ 6pj  for 

MB  I and  8pj  for  SSI 

Fully  compensated  logic  levels  and  threshold  - Noise  margins 
an-i  propagation  delay  remain  constant  over  operating  tempera- 
turti  and  vt:)ltage  ranges 

Source  terminated  interconnection  - On-chip  output  resistors 
prr)perly  terninate  50  ohm  transmission  lines 

Coiu|:>le'mentery , sijnultaneous  outputs  - Simplifies  design^ 
mini  .r,  I /. e s c i os s-  ta  1 k 

Small  logic  swing  of  440inV  - Provides  higher  speed  at  lower 
pt.wer^  Aoise  generation 

0:nsl  't  supply  current  - Reduces  noise^  fewer  or  smaller 
d»  couf:  ».  j r»g  rapacitors,  no  dependence  on  operating  frequency 

Aevaoc(id  Circuit  Technique  - On-chip  use  of  series-gating,  gate 
staci\i;uj,  eM  ( Emi tter-Funct ion-Log ic)  , »Schottky-Oiod'i  gating, 
wired-OR  and  -AND,  staggered  thresholds,  etc.  provide  best 
functional  density  at  lowest  power  level 

- Test  Diacnostic  pin  (TfitD)  - Facilitate  testing  of  individual 

packages  and  isolate  faulty  packages  in  operating  environment 

50  pad  package  - Increases  lofjic  function  capability  per  device 
and  .:educcs  package  count 

Multi-chip  package  - Increases  packaging  density,  improves 
performance  and  reduces  system  cost 

SCI  to  Lol  densities  - 50  pad  package  has  capacity  to  accomo- 
date gate  densities  in  the  order  ot  1000  gates  per  chip 
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Figure  5*16  Multilayered  Printed  Circuit  Board  for  ECL 


Figui'e  5,17  illustrates  the  component  side  of  a fully  populated 
board  assembly  of  the  PEPE  type.  The  aluminum  electrolytic 
capacitors  along  the  top  and  bottom  edges  of  the  assembly  are 
utilized  to  bypass  voltage  noise  for  frequencies  below  1 MHz.  Two 
other  levels  of  bypassing  control  voltage  noise , the  interlayer 
capacity  of  the  board  for  frequencies  above  20  MHz  and  ceramic 
capacitors  (contained  in  the  terminator  resistor  packages)  for 
frequencies  between  1 and  20  MHz.  The  board  assembly  is  mounted 
in  a diecast  aluminum  alloy  frame.  Camming  type  handles  are 
mounted  on  the  front  of  the  frame  to  provide  the  insertion  force 
to  mate  the  four  100-p\n,  I/O  connectors.  The  I/O  connectors 
incorporate  a unique  socket  design  that  results  in  low  insertion 
force  and  low  contact  resistance.  A single  100-pin  connector 
nominally  requires  around  a 13-pound  insertion  force.  Four 
connectors  would  result  in  an  insertion  force  of  approximately  52 
pounds.  The  handles  are  also  used  to  lock  the  board  in  place. 
Each  circuit  card  module  assembly  is  supported  by  shear/locating 
pins  in  front  and  rear. 


This  assembly  can  accomodate  cam  action  zero  insertion  force 
connectors  which  in  turn  can  accomodate  the  edge  connector  of 
belted  cable  paddleboard  assemblies. 


The  assembly  may  be  adapted  to  mount  dual-in-line  sockets.  Each 
socket  is  soldered  to  the  board  to  pick  up  the  printed  circuit 
signal  trace.  In  addition,  wire-wrap  tail  on  the  socket  provides 
for  two  levels  of  wire-wrap.  (Figure  5.18). 
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Figure  5.18  Multilayered  Printed  Circuit  Assembly  with  Dual 
In-Line  Devices  and  Sockets 


5.12.2.3  Interconnections 

Two  primary  techniques  for  the  interconnection  of  the  basic  assem- 
blies (processors,  memory  modules,  etc.)  help  guarantee  feasibi- 
lity of  the  FMP.  Wherever  possible,  interconnections  will  be  made 
with  paddle  board  and  belted  cable  assemblies.  Belted  transmis- 
sion cable  with  up  to  70  conductors,  (AWG  28  or  30,  silver  plated) 
on  0.025  inch  centers  suitable  for  the  FMP  signal  levels  and 
frequencies  is  readily  available.  Techniques  for  semi-automatic 
assembly  of  these  cables  to  paddleboards  with  edge  connectors  are 
fully  developed  and  provide  the  economical  reliable 
interconnections. 

Where  the  use  of  belted  transmission  line  is  impractical,  inter- 
connections are  achieved  with  subminiature  50-ohm  coaxial  wire. 

The  coax  consists  of  No.  32  AWG,  silverplated-drain  or  ground 
conductor;  a wrapped  tape  shield  of  aluminized  mylar;  and  an  outer 
jacket  of  laminated  mylar.  The  maximum  overall  size  of  the  cable 
is  0.033  inch  x 0.043  inch.  The  drain  conductor  is  compressed 
between  the  aluminum  side  of  the  shield  and  the  primary  insulation 
such  that  the  drain  wire  is  in  contact  with  the  shield  along  the 
full  length  of  the  cable.  Both  conductors  (ground  and  signal)  are 
wrapped  simultaneously  on  adjacent  pins  (on  0.100  inch  centers) 
using  a dual-bit  wire-wrap  gun  as  shown  in  Figure  5.19. 


5.12*2.4  Backplanes 

Backplanes  for  power  distribution  are  not  required  for  the  proces- 
sors as  they  have  individual  power  supplies.  However,  in  the  case 
of  the  coordinator  and  connection  network  it  may  be  more  desirable 
to  have  a centralized  power  source  which  for  high  speed  ECL  techno- 
logy would  normally  require  a laminated  backplane  assembly. 

This  assembly  consists  of  three  layers  of  epoxy-coated  copper.  It 
serves  the  dual  functions  of:  1)  mounting  the  female  half  of  the 
circuit  card  module  assembly  connectors,  and  2)  efficiently  distri- 
buting power  to  each  circuit  card  module  assembly  by  providing  a 
low  impedance  power  distribution  network. 

Power  is  distributed  to  each  circuit  card  module  assembly  via  pins 
soldered  to  the  individual  backplane  layers  as  shown  in  Figure 
5.19.  A wire  wrap  connection  is  then  implemented  between  the 
backplane  and  associated  connector  pins.  Multilayer,  laminated 
backplanes  are  required  to  minimize  backplane  impedance  (primarily 
inductive).  A low  inductance  offers  a low  impedance  to  surge 
currents,  guarantees  power  supply  stability,  and  gives  fast  power 
supply  response  time. 

5.12.2.5  Cabinet  Frame  Assembly  and  Doors 

At  this  time,  it  is  anticipated  that  the  FMP  equipment  would  be 
housed  in  cabinets  similar  to  those  used  on  other  advanced  proces- 
sor systems  currently  being  made  by  Burroughs.  A description  of 
these  assemblies  is  provided  in  the  following. 

The  cabinet  frame  is  constructed  of  0.120-inch-thick  rectangular 
steel  tubing  welded  into  a unitized  frame.  In  certain  areas  the 
rectangular  steel  tubing  is  increased  in  thickness  to  0.180  inch 
for  strength  considerations.  The  overall  dimensions  of  the  basic 
weldment  are  typically  81  inches  high  by  up  to  72  inches  wide  by 
at  least  30  inches  deep.  Maximum  envelope  dimensions  of  the 
cabinet  assembly,  including  all  doors  and  end  panels  are  81  inches 
high  by  100  inches  wide  by  32  inches  deep. 

Bi-fold  doors  are  utilized  on  the  front  and  rear  faces  of  the 
cabinet.  Each  bi-fold  assembly  (there  are  four)  is  composed  of 
two  0.75-inch-thick  aluminum  honeycomb  panels  connected  by  a 
unique,  extruded,  continuous  hinge.  The  stationary  panel  on  the 
right-hand  end  of  the  cabinet  is  constructed  of  0.062-inch-thick 
formed  aluminum.  A hinged  split  door  configuration  is  utilized  on 
the  end  of  the  cabinet  to  provide  access  to  the  rear.  The  overall 
thickness  of  the  split  door  is  2.13  inches.  Each  door  section  is 
comprised  of  1-inch-thick  aluminum  honeycomb  and  0.0062  inch-thick 
formed  aluminum. 


5cl2*2.6  BCML  Packaging 


5.12.2.6.1  General . A complete  family  of  packing  hardware  has 
been  specifically  developed  for  the  Burroughs  CML  circuit  family. 
This  advanced  hardware  family  incorporates  features  to  accomodate 
subnanosecond  high  density  circuits  of  greater  than  1000  gates 
each  for  use  in  commercial  state  of  the  art  computer  systems.  The 
family  includes  low  cost  modular  liquid  cooling  and  power  distri- 
bution systems.  The  design  concepts  placed  high  consideration  on 
manufacturability  and  ease  of  assembly  debugging  and  maintenance. 

The  BCML  packaging  system  provides  hardware  that  can  be  used 
across  the  Burroughs  product  lines  of  the  computer  systems  and  for 
other  special  applications.  The  basic  philosophy  of  this  pack- 
aging system  was  to  partition  the  second  level  packaging  to  be 
compatible  with  functional  logic  partitioning.  By  packaginu  a 
system  function  within  an  integral  unit/  the  number  of  r/0*s 
between  units  is  minimized  and  critical  functions  can  usual!/  < 
restricted  within  this  unit. 

5.12.2.6.2  Circuit  Packaging.  The  basic  partitioning  size 
selected  for  the  BCML  system  is  a printed  wiring  board  14”  x 21”. 
This  unit  is  referred  to  as  an  island  and  can  accommodate  10,000 
logic  gates  with  the  current  normal  mix  of  SSI,  MSI.  and  LSI  BCML 
parts.  With  the  increased  usage  of  LSI,  and  VLSI  circuits  island 
gate  capacity  will  be  enhanced. 

Another  basic  goal  of  the  BCML  packaging  system  is  to  provide  for 
ease  of  field  maintainability.  The  following  are  some  of  the 
packaging  as  well  as  circuit  features  that  facilitate  service- 
ability : 


1.  Plug-in  logic  packages. 

2 A probe  system  to  allow  simultaneous  contact  of  all  logic 
package  pins. 

3.  Provision  for  in-place  testing  of  circuits. 

4.  No  external  components  in  wiring  nets. 

5.  Test  and  Diagnostic  pin  (T&D)  incorporated  on  logic 
packages. 

The  first  level  of  packaging  was  selected  to  accommodate  a circuit 
family  aimed  at  high  gate  densities.  Two  package  sizes,  25  pins 
and  51  pins,  are  utilized.  Multi-chip  versions  of  the  51  pin 
package  can  accommodate  up  to  3 I.C.  chips. 


The  packages  themselves  are  a leadless  hermetic  ceramic  construc- 
tion ♦ The  package  has  gold  plated  contacts  on  50  mil  centers  in 
two  rows  on  its  edges*  The  package  also  has  an  integral  metal 
heat  sink  plate.  This  member  conducts  heat  generated  by  the 
circuits  to  a liquid  cooled  frame  and  also  serves  as  a low  induc- 
tance ground  connection. 

Two  25  pad  packages  or  one  51  pad  package  mates  with  a 50  pin 
connector . This  connector  will  also  accept  two  (24)  signal  I/O 
cables  or  one  50  signal  I/O  cable.  The  interfaces  of  all  the 
pressure  contact  systems  are  gold  plated  for  high  reliability. 
Two  types  of  connectors  are  available;  this  first  type  is  soldered 
to  the  interconnecting  printed  circuit  board  and  has  a wire 
wrappable  tail  while  the  second  makes  a pressure  contact  to  a gold 
pad  on  the  printed  circuit  board.  These  two  styles  of  connectors 
provide  flexibility  in  design  of  the  island  interconnect  media. 

There  are  108  connectors  mounted  on  the  logic  island  as  well  as  a 
liquid  cooled  frame.  The  cold  frame  also  serves  as  a low  resis- 

tance ground  return  path.  Interconnection  of  circuits  on  an 
island  is  accomplished  by  a combination  of  P.C.  lines  and  open 
wire.  A multi-layer  board  with  internal  voltage  and  ground  planes 
and  two  external  signal  layers  with  50  ohm  lines  are  used  for  the 
bulk  of  the  interconnections.  The  shorter  lines  can  be  imple- 
mented by  automatic  Gardner-Denver  wiring  with  no  performance 
penalty*  An  all  wired  utility  board  system  utilizing  controlled 
impedance  twin  lead  and  open  wire  is  available  for  prototype  and 
limited  production  systems.  Higher  density  and  lower  cost  P.C. 
interconnection  systems  are  being  developed  for  both  the  snider 
tail  and  double  contact  connectors. 

Islands  are  interconnected  with  a high  quality  50  ohm  transmission 
belt  (24  or  50  signals).  Since  a cable  interfaces  witn  the  same 
socKet  as  a logic  package^  the  ratio  of  I/O  pins  to  logic  posi- 
tions is  not  fixed  by  the  hardware^  but  is  established  by  the 

logic  design.  This  flexibility  provides  for  efficient  island 

utilization.  Figure  5.20  j hows  an  island  assembly  mounted  in  an 
module  with  belted  cables  interconnections. 

5.12.2.6.3  Frame/  Cooling  & Power;  In  addition  to  a standard 
logic  family^  island/  and  interconnecting  belts / the  BCML  pack- 

aging system  also  provides  a mounting  structure/  cooling  system 
and  power  system  for  a 10  island  module.  This  10  island  module 

can  be  used  individually  for  smaller  systems  or  can  be  stacked  2 
and  3 high  for  larger  systeras.  The  50  ohm  belted  calbes  provide 
module  to  module  inter connections , 


1 

I 

The  module  assembly  enables  the  islands  to  fold  out  permitting 
i front  and  rear  access  thus  facilitating  testing  and  maintenance, 

l!  This  feature  is  illustrated  in  figure  5.20. 

j The  cooling  system,  which  can  dissipate  a 3.6  KW  heat  load,  con- 

I sists  of  cold  frames  mounted  on  the  islands,  a circulating  pump, 

fans,  and  a liquid  to  air  heat  exchanger.  Air  for  the  cooling 

system  is  drawn  from  the  computer  room.  For  highly  reliable 

operation  junction  temperatures  are  restricted  to  80^C  with  a 40^C 
’ ambient.  Much  higher  power  (or  lower  junction  temperature)  could 

be  obtained  by  using  a liquid  to  liquid  heat  exchanger  with  a 
\ chilled  coolant  circulated  through  the  island.  This  system  does 

not  require  air  circulation  in  the  computer  room,  with  heat  being 
dissipated  directly  to  the  building  chilled  water  supply. 

The  BCML  power  system  is  designed  to  be  driven  by  an  M-G  set  or  an 

equivalent  line  isolator*  Large  systems  may  be  operated  from  a 

site  M--G  set  but  a 20  KVA  M-G  set  has  been  packaged  in  a sound 
proof  cabinet  for  installation  in  the  computer  room  for  use  with 
small  to  medium  systems. 

The  power  supply  itself  is  a very  simple  and  reliable  design, 
consisting  of  only  a transformer  and  rectifiers.  Output  is  -2.7V 
^ 30%  and  -4.8V  ^ 25%.  Pinal  regulation  is  provided  by  circuitry 
on  the  logic  chip.  This  on-chip  regulation  produces  a constant 
current  load.  Therefore  voltage  decoupling  capacitors  are  not 
required  on  the  P.C.  board. 

5.13  IMPLEMENTATION  TOOLS 

Burroughs  Corporation  has  a central  Design  Assistance  (CDA)  Depart- 
ment which  is  charged  with  the  responsibility  of  developing  and 
maintaining  a comprehensive  set  of  tools  to  aid  in  the  design, 
manufacture,  and  maintenance  of  computer  systems.  These  tools  are 
then  adapted  as  needed  and  used  by  the  various  design  and  manu- 
facturing groups. 

The  design  of  a complex  system  such  as  the  PMP,  requires  the  use 
of  such  tools.  The  Design  Assistance  System  (DAS)  and  the 
Burroughs  Interactive  Logic  Design  (BUILD)  program  are  examples  of 
aids  used  during  design.  Specifically,  the  DAS  programs  provide 
assistance  in  the  development  of  manufacturing  tooling  from  a 
detailed  logic  design.  The  areas  supported  are: 

Logic  partitioning 
component  placement 
orinted  circuit  routing 
wire  wrap  routing 
logic  simulation 
test  generation 
Logic  schematic  generation 
rules  check 

^ numeric  control  generation 
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In  addition,  design  data  is  maintained  in  a centralized  data  base 
to  insure  design  integrity. 

The  Burroughs  Interactive  Logic  Design  (BUILD)  program  allows  a 
design  engineer  to  hierarchically  specify  a logic  design^  and  to 
verify  its  correctness  using  functional  simulatin  techniques. 
After  logic  verification,  netlists  are  generated  from  the  logic 
specification  and  entered  into  the  DAS  engineering  data  base  for 
physical  implementation. 

Figure  5.21  depicts  these  two  systems  as  they  would  be  used  by  the 
NAS?  project. 
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Figure  5.21  NASF  Hardware  Design  and  Implementation  Support  System 


CHAPTER  6 


<♦ 


TRUSTWORTHINESS  AND  AVAILABILITY 


6.1  TRUSTWORTHINESS,  AVAILABILITY,  AND  ERROR  CONTROL 

6.1.1  General  Requirements 

As  the  introduction  to  Chapter  5 has  already  emphasized,  the  FMP 
has  certain  requirements  for  trustworthiness,  availability,  and 
error  control.  Among  these  basic  requirements  are: 

- System  availability  of  90%  or  better,  implying  an  FMP 
availability  of  approximately  95%  or  better,  for  20  hours  a 
day, 

- Mean  time  between  aborts  visible  to  the  user  of  over  10 
hours , 

- Probability  of  apparently  successful  but  wrong  runs  much 
lower  that  the  probability  of  an  abort. 


In  order  to  satisfy  the  above  requirements,  a number  of  features 
are  built  into  the  design,  including: 

Spare  processors  and  extended  memory  (EM)  modules,  with 
so ftware^-contr oiled  reconfiguration 

- Duplexed  operation  with  comparison  of  results 

- Error  detection  and  error  correction  on  all  memories 

- "Scrubbing”  through  CCD  memory  and  dynamic  RAM  memory  to 
find  and  correct  any  spontaneously  occuring  errors  within 
them 

Fault  detection  within  logic  circuitry  (processor, 
coordinator,  etc.) 

- Software-controlled  restart  following  a program  abort 

- Logging  of  all  errors,  analysis  of  the  logs 

- Testing  of  invariants  in  the  computation 

- The  ability  to  observe  externally  the  state  of  the  FMP 

- A system  of  diagnostics  and  confidence  checks 

- Error  detection  in  file  system,  both  storage  and  transfer 
paths 
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These  features  are  implemented  by  a combination  of  hardware  and 
software. 

The  trustworthiness  of  computation  on  the  NASP  is  the  combined 
result  of  a series  of  influences,  including 

- System  software 

- Hardware  reliability 

- Hardware  error  detection 

Completeness  of  the  confidence  and  diagnostic  checks 

- Applications  programming  characteristics 

- Accuracy  of  failure  identification 

- Throughness  of  checks  for  software  errors 

6*1,2  Design  Requirements 

Additional  characteristics  can  be  derived  from  the  basic  require- 
ments of  the  previous  section*  These  characteristics  were  derived 
in  Reference  5,  and  can  be  summarized  as  follows: 

- Less  than  1 bit  In  10^*7  in  undetected  error  from  processor 
memory 

- Less  than  1 bit  in  10^5  with  error  detected  but  uncorrect- 
ible  from  processor  memory 

- Less  than  1 bit  in  10^^  in  undetected  error  from  BM 

- Less  than  1 bit  in  10^3  with  error  detected  but  uncorrect- 
ible  from  EM 

Less  than  1 bit  in  10^3  bits  refreshed  in  DBM  shall  have  an 
undetected  error 

The  derivations  were  based  on  observations  on  how  many  bits  were 
accessed  from  memory  and  from  extended  memory  during  the  typical 
15-minute  run,  and  on  an  assumed  time  of  residency  in  DBM  that 
might  be  as  long  as  a day. 

6.1.3  Sparing  and  Duplex  Processing 

Every  processor  cabinet  has  129  processor  slots;  every  EM  cabinet 
has  132  EM  module  slots.  In  the  coordinator,  there  are  four  reg- 
isters, one  per  processor  cabinet,  that  designate  the  spare  pro- 
cessor in  that  cabinet.  Spare  EM  modules  are  designated  by  regi- 
sters in  the  CN  buffer  of  every  processor.  The  coordinator  broad- 
casts the  designation  of  the  spare  to  the  CN  buffer,  using  BDCST 
instruction,  and  follows  that  with  a PILLR  command  to  load  these 
registers.  Thus  the  designation  of  which  modules  are  spare  is 
changed  by  software  resident  on  the  coordinator. 
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Duplex  processing  has  been  proposed  as  a means  of  providing 
dynamic,  run-time  checking  of  processors  by  comparing  the  results 
of  the  same  set  of  computations  performed  in  two  different 
processors*  Two  approaches  were  considered  and  are  discussed 
below* 

The  spare  processor  designation  is  used  to  provide  a duplex  mode 
of  operation*  First,  one  must  make  sure  that  there  are  516  good 
processors  in  the  FMP*  Second,  processor  #128  is  designated 
"spare"  in  each  cabinet.  This  makes  programmatic  processor 
numbers  0 through  127  fall  on  physical  location  number  0 through 
127.  Third,  the  program  is  run.  Fourth,  Processor  #0  is 
designated  "spare"  in  each  cabinet.  This  makes  programmatic 
processor  numbers  0 through  127  fall  on  physical  processors  1 
through  128,  so  that  every  computation  in  the  run  will  fall  into  a 
different  processor  than  the  first  time.  Fifth,  the  program  is 
run  again*  Sixth,  the  results  of  the  second  run  are  compared  for 
the  expected  match  with  the  results  of  the  first  run. 

Another  form  of  duplexed  processing  was  considered  during  the 
course  of  the  study.  Here  the  duplex  mode  would  be  implemented 
through  some  additional  hardware.  The  512  processors  would  be 
divided  into  256  sets  of  2 each*  The  application  program  would  be 
compiled  and  run  as  if  only  256  processors  were  available.  Each 
set  of  2 processors  would  execute  the  same  code  on  the  same  data, 
and  the  resultts  of  each  would  be  compared.  The  operation  of  the 
CN  is  such  that  continuous  synchronization  between  the  two  members 
of  the  pair  would  require  additional  hardware  means  than  described 
in  Chapter  5,  such  as  making  both  processors  use  the  CN  buffer  of 
one  of  them.  A hardware  comparator  would  monitor  the  performance, 
and  errors  would  be  detected  as  soon  as  the  outputs  to  the  CN  or 
to  the  coordinator,  of  the  two  processors,  fail  to  match.  Because 
of  the  synchronization  problems,  and  because  there  seems  to  be  no 
real  advantage  of  this  scheme  over  the  purely  software  duplexed 
computation  described  first,  the  hardware  comparator  has  not  been 
included  in  the  design. 

6*1*4  Error  Correction  in  Memories 

All  memory  has  error  detection  and  error  correction  in  order  to 
achieve  the  very  low  error  rates  of  the  requirements.  Error 
detection  is  a necessary  part  of  hard  failure  detection.  Error 
correction  is  proposed  based  on  expected  memory  error  rates 
between  1 bit  in  10^  and  1 bit  in  10^^. 

For  processor  memory  and  extended  memory,  a SECDED  (single  error 
correction,  double  error  detection)  code  is  proposed.  The  actual 
error  rates  in  the  chips  would  have  to  be  very  good  indeed  before 
simple  parity  plus  retry  would  provide  adequate  correction.  The 
actual  error  rates  would  have  to  be  very  bad  (worse  than  1 bit  in 
10^)  before  simple  SECDED  was  not  good  enough. 


For  Data  Base  Memory  (DBM),  a higher  intrinsic  error  rate  is 
expected  from  the  chips,  since  the  geometries  on  the  chips  are 
smaller,  and  since  more  refreshes  occur  per  access • Also,  a 
higher  standard  of  performance  is  required,  since  any  given  datum 
will  go  through  many  read-write  restorations  during  the  lifetime 
of  the  data  in  DBM.  As  the  computations  in  Ref.  5 show,  we  expect 
that  the  same  simple  SECDED  will  also  be  adequate  error  correction 
in  DBM.  However,  the  safety  factor  is  substantially  less,  and  a 
reevaulation  of  this  choice  should  be  made  when  the  soft  failure 
rate  of  the  256K  CCD  chips  become  known* 

in  the  DBM  it  is  also  necessary  to  periodically  read  each  word, 
make  necessary  corrections  if  possible,  and  write  it  back  in,  in 
order  to  keep  the  probability  of  multiple  errors  low  enough.  This 

process  is  called  **scr ubbing**  and  is  expected  to  be  designed  into 

any  memory  system  requiring  it.  Therefore,  the  DBM  will  not 

require  any  external  controls  for  scrubbing.  The  errors  removed 
by  "scrubbing”  are  called  "soft  errors".  This  term,  soft  errors, 
implies  failures  where  the  contents  of  the  storage  cell  have  been 
modified  in  some  unexpected  or  unplanned  way  (such  as  by  the 

effect  of  background  radiation) , but  which  are  not  the  permanent 
inability  of  a storage  cell  to  operatoe  correctly.  The  following 
paragraphs  discuss  the  SECDED  scheme  proposed  and  also  discuss  the 
scrubbing  of  errors  out  of  DBM. 

6. 1.4.1  SECDED 

For  soft  failures,  the  previous  studies  (5)  show  that  the  improve- 
ment factor  due  to  error  correction  is  essentially  infinite;  that 
is,  the  system  would  be  unable  to  produce  useful  results  without 
error  correction  at  the  presumed  soft  error  rates.  For  hard 
failures,  the  improvement  factor  due  to  the  use  of  error  correc- 
tion depends  on  the  failure  modes.  Some  failures,  such  as  an 
address  decoding  failure  external  to  the  memory  chip  that  causes 
multiple  bit  errors,  are  not  helped  by  error  correction.  A fail- 
ure internal  to  a single  memory  chip  is  helped  by  error  correc- 
tion. In  addition,  the  error  correction  circuits  have  failures 

that  would  not  occur  if  there  were  no  error  correction.  The 
analysis  following  in  Section  6.3.3  recognizes  the  other  effects 
contributing  to  undetected  errors.  That  section  uses  a very 
conservative  improvement  factor  of  5 in  the  number  of  observed 
errors  when  using  SECDED  for  correcting  hard  failures  vs  the 
situation  where  no  SECDED  is  used.  The  following  discussion 
addresses  these  statements  in  more  detail. 

First  consider  the  case  of  soft  failures  as  represented  by  read 
failures.  About  5 x 10^2  operands  are  used  or  produced  during  the 
course  of  the  "typical"  15  minute  run  (5).  If  half  of  these  come 
from  processor  memory,  that  means  almost  2 x 10^^  bits  are  fetched 
from  processor  memory  during  the  course  of  a typical  run. 
Although  accurate  projections  of  bit  error  rates  for  large  semicon- 
ductor memory  chips  await  more  experience  , it  is  plausible  that 
bit  error  rates  may  lie  between  one  bit  in  10^^  to  one  in  10^^ 
bits  read.  Under  the  above  conditions,  without  error  correction, 
it  is  unlikely  that  the  typical  run  can  even  complete  successfully 
due  to  soft  errors. 


For  hard  failures  however,  the  picture  is  different.  if  one 
memory  chip  output  is  stuck  in  one  processor,  only  1/521  of  the 
words  accessed  are  affected  by  that  failure.  The  processor  memory 
delivers  ^ x 10^^  bits  during  the  course  of  the  run.  If  one  bit 
in  every  word  in  that  one  processor  is  bad,  and  if  the  soft  error 
rate  is  1 in  10^2  or  better,  the  run  will  probably  complete 
successfully.  A failure  at  a specific  bit  in  one  chip  is  even 
less  likely  to  cause  trouble. 

Since  double  errors  or  worse  are  not  corrected  automatically  with 
the  proposed  SECDED  code,  it  is  important  to  use  preventative 
measures.  When  the  SECDED  logic  corrects  a failure,  a log  will  be 
updated  indicating  the  word  and  bit  position  in  the  memory  which 
was  corrected.  These  logs  will  be  examined  regularly  in  order  to 
detect  and  replace  failed  parts  before  they  cause  an  abort. 

The  error  correcting  code  of  Table  6.1  appears  to  be  the  best 
choice  for  the  PMP.  First,  it  is  directly  implementable  by  the 
Motorola  SECDED  parity  generator  chip  (each  S^bit  wide  slice  of 
the  code  exhibits  exactly  the  same  pattern  of  parity  checks  as 
found  in  that  chip).  Second,  it  is  much  better  than  a randomly 
selected  SECDED  at  detecting  triple  errors.  Even  the  optimum 
SECDED  is  not  very  good  at  detecting  triple  errors  when  there  are 
55  bits  used  out  of  the  underlying  64-bit  long  Hamming  plus-parity 
code  block.  This  proposed  code  is  almost  as  good  as  that  optimum. 

Each  **x”  in  Table  6.1  means  that  that  bit  is  included  in  the 
parity  check  represented  by  its  corresponding  checkbit.  The  seven 
check  bits  are  the  6 bits  of  the  Hamming  code,  plus  a bit  that 
allows  an  overall  parity  check.  For  improved  performance  against 
multiple  errors,  the  7th  bit  contains  an  ”x”  only  for  those  bit 
positions  which  enter  into  0,  2,  or  4 of  the  other  check  bits. 
Actual  overall  parity  is  the  parity  of  all  seven  check  bits.  Odd 
parity  is  used. 

The  bit  number  in  Table  6.1  is  not  the  bit  number  of  the  data 
word.  For  one  thing,  the  check  bits  are  interspersed.  Tie  corres- 
pondence of  bit  number  as  seen  by  the  programmer  to  the  bit  number 
of  Table  6.1  is  arbitrary.  This  mapping  will  be  left  as  a logic 
designers  option. 

Triple  errors  appear  to  be  single  errors  to  the  proposed  code. 
Some  triple  errors  will  be  detected  when  the  SECDED  circuits 
detect  a failure  and  attempt  to  correct  a bit  outside  the  55-bit 
field  shown  in  the  table  (this  is  possible  since  the  code  chosen 
is  a portion  of  an  underlying  64-bit  long  Hamming  plus-parity  code 
block).  The  code  shown  in  Table  6.1  detects  14.6%  of  all  triple 
errors,  whereas  a randomly  selected  SECDED  would  be  expected  to 
detect  14.1%  (nine  bit  locations  out  of  64  are  outside  the  55-bit 
word) . 
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TABLE  6.1 


Error  Correcting  Code 


Check  bits  XXX  X X X X 

Bit  Number  * 0000000000111111111122222222223333333333444444444455555 

0123456789012345678901234567890123456789012345678901234 


1st 

Check  bit 

2nd 

.•XX., XX* .XX., XX., 

Parity 

3rd 

....XXXX....XXXX., 

Patterns 

4th 

30CXXXXXX  • . 

5th 

X3 

6th 

Parity 

X.  .X.XX..XX.X..X.} 

K XXXXXXXX 

1C .....xxxxxxx 

.xxxxxxxxxxxxxxxxxxxxxxx 


* The  assignment  of  bit  number  (corresponding  to  Hamming's)  may  be 
different  than  the  assignment  to  be  found  in  the  register  to  which 
this  parity  check  is  attached.  The  bit  number  found  here  is  the 
one  generated  as  an  indication  of  the  bit  to  be  corrected  in  the 
error  correcting  code. 
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Codes  which  are  useful  at  detecting  triple  errors  are  also  of 
interest.  One  additional  check  bit  allows  a code  in  which  triple 
errors  are  almost  always  detected  (better  than  90%  of  the  time). 
The  price  for  this  improved  error  detection  capability  is  a 
connection  network  (CN)  one  bit  wider  or  extended  memory  (EM) 
access  time  20  ns  longer,  more  complex  parity  checking,  more 
complex  decoding  of  the  bit  in  error  and  2%  more  memory.  Current 
estimates  of  memory  chip  bit  error  rates  imply  that  this  addition- 
al complexity  is  not  warranted. 

SECDED  checking  and  generating  logic  is  found  in  the  following 
locations: 

- Processors,  where  the  processor  generates  check  bits  for 
all  memories  it  accesses  (both  PM  and  EM  via  the  CN  buffer), 
and  checks  words  fetched  from  the  PM  or  received  via  the  CN 
buffer. 

- Coordinator,  where  the  function  is  parallel  to  that  in  the 
processors. 

- DBM,  in  the  channels  to  and  from  the  file  system. 

SECDED  logic  is  not  needed  in  the  EM  modules,  since  all  EM  data 
will  have  check  bits  when  stored,  and  will  have  their  codes  check- 
ed at  some  point  after  being  fetched  from  EM,  usually  upon  being 
read  from  a CN  buffer  in  a processor. 

In  addition  to  the  SECDED  on  all  memory  data,  there  are  some 
simple  parity  checks.  The  address-plus-instruction-code  sent 
through  the  CN  for  controlling  EM  buffers  has  parity  checked  at 
EM.  The  contents  of  microprogram  memory  in  the  processor  have  a 
parity  bit. 

The  responses  to  SECDED  and  parity  errors  are  as  follows: 

1.  EM  module  detects  parity  error  on  module-number /address/ 
op-code  field  sent  from  processor.  The  EM  module  does  not  return 
an  Acknowledge  on  bad  parity,  so  the  processor  will  continue  to 
send  the  same  request.  If  the  error  was  a transient,  proper 
operation  will  resume.  If  the  error  was  a hard  error,  the  proces- 
sor will  hang  on  trying  this  request,  eventually  causing  the  co- 
ordinator to  have  a time-out  Interrupt.  The  EM  module  sends  an 
"address  parity  bad"  interrupt  to  the  coordinatjr.  This  would 
normally  be  masked  off  to  allow  useful  processing  to  continue  in 
those  cases  where  the  retry  works. 

2.  Processor  corrects  single  error.  An  interrupt  to  processc'- 
resident  software  results  in  the  logging  of  the  action  in  a table 
in  processor  memory. 
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3,  Processor  detects  double  error  in  word  received  from  EM.  The 
processor  halts  with  interrupt,  and  the  program  is  discontinued. 
Software  can  restart  the  program  from  some  prior  point,  possibly 
after  system  reconfiguration. 

4.  In  all  of  the  above,  the  requestor  may  have  been  the  CN  buffer 
of  the  coordinator  used  by  the  coordinator  for  accessing  EM.  In 
these  cases,  read  "coordinator"  where  the  previous  two  sections 
say  "processor". 

6. 1.4. 2 Scrubbing  Errors  out  of  CCD  Memory  and  Dynamic  RAM 

In  the  case  of  CCD  memories,  errors  are  not  confined  to  the 

reading  and  writing  process.  Errors  can  also  arise  within  the 
memory  chips.  If  data  is  stored  in  a particular  location  with  no 
reference  for  a long  time,  such  as  hours  or  days,  the  probability 
of  errors  may  become  intolerably  high.  It  will  be  necessary, 
therefore,  to  continually  scan  through  the  data  base  memory  (DBM) 
correcting  all  the  single-bit  errors  in  order  to  allow  the 

survival  of  the  data  base  for  a long  enough  period  of  time. 

Depending  on  the  magnitude  of  the  soft-error  problem,  it  may  be 
feasible  to  use  a stronger  error-correction  code,  and  thus 
eliminate  the  scrubbing.  With  scrubbing,  the  probability  of 
non-correctible  errors  grows  linearly  with  time,  the  envelope  of 
pieces  that  individually  have  the  form  t®  where  e is  the  number  of 
errors  in  the  uncorrectible  case  (Figure  6.1).  With  stronger 

error  correction,  correcting  f errors,  the  curve  has  the  form  t^. 
e=2  for  Hamming  plus  parity,  f can  equal  any  number  for  a BCH* 
code  (7).  Clearly,  the  "scrubbing"  storage  design  has  more  lati- 
tude against  variations  in  error  rate. 

The  critical  aspect  of  DBM  is  the  storage  of  restart  files,  up  to 
10^  bits,  for  times  that  presumably  could  be  days.  The  method  of 
error  correction  used  will  depend  on  the  technology  to  be  used  for 
the  file. 

To  determine  the  optimum  rate  for  scrubbing  errors  out  of  CCD 
memory,  we  should  know  both  the  error  rate  for  spontaneously 
occurring  errors,  and  the  error  rate  for  the  reading  and  writing 
process.  For  any  given  error  correcting  code,  there  will  be  an 
optimum  scrubbing  frequency  where  the  two  sources  of  error  are  in 
balance  and  are  a minimum. 

In  Reference  5,  the  assumption  was  made  that  the  CCD  memory  of  DBM 
would  lose  on  the  average,  1 bit  per  3 x lO^®  bits  shifted.  This 
error  rate  was  based  on  preliminary  experience  reported  by  Fair- 
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Figure  6*1  Scrubbing  versus  Read-Time  Error  Correction 


child.  Since  then,  the  cause  of  loss  of  bits  has  been  identified 
as  background  radiation r primarily  due  to  alpha  particles  coming 
from  contaminants  in  the  package.  More  recent  quantitative  data 
is  not  available.  New  raanufacutr ing  techniques  by  the  vendors 
appear  to  be  solving  the  problems.  On  the  basis  of  the  original 
soft-failure  rate  data,  a scrubbing  rate  of  once  every  seven  min- 
utes will  be  enough  to  keep  a 10^  bit  file  error-free  for  one  day 
with  probability  0.999. 

Scrubbing  in  the  DBM  will  make  use  of  hardware  and  data  paths 

which  would  exist  even  if  scrubbing  were  not  necessary.  In  parti- 
cular, the  channels  to  and  from  the  file  system  have  buffers  and 
SECDED  checkers  and  generators  associated  with  them.  Part  of  the 
normal  channel/interconnection  path  capabilities  would  be  a loop- 
back  mode  for  diagnostics.  All  of  these  capabilities  can  be 
utilized  to  implement  scrubbing  as  needed.  The  DBM  controller 

will  schedule  blocks  (probably  16K  words)  to  the  channel  buffers 

through  the  SECDED  checker/generator  and  back  to  the  CCD  store. 

The  maximum  transfer  rate  between  the  DBM  and  the  file  system  is 
expected  to  be  40  Mbits/ sec.  At  this  rate,  the  entire  DMB  can  be 
read  in  3.5  minutes.  Periods  of  high  channel  activity  imply 

lowered  requirements  on  scrubbing  due  to  natural  activity  within 
the  DBM,  It  is,  therefore,  reasonable  to  plan  to  use  some  of  the 
channel  capabilities  (buffers,  SECDED,  loop-back)  to  implement  the 
scrubbing  functions.  If  DBM  blocks  are  16K  words  (a  likely  result 
of  CCD  organizations),  and  if  the  scrub  cycle  needs  to  be  seven 
minutes,  then  the  scrub  rate  is  one  block  every  51.4  msec. 

AS  the  geometr ies  of  the  individual  cells  of  integrated  circuits 
shrink,  other  parts  are  expected  to  evidence  soft-error  problems 
similar  to  that  being  experienced  by  CCD  parts  now.  256  Kbit 
dynamic  RAMs,  which  may  be  considered  as  a technological  alterna- 
tive to  the  256  Kbit  CCD*s  depending  on  the  design  and  implemen- 
tation schedule),  are  expected  to  experience  a soft-error  rate 
large  enough  to  also  require  scrubbing.  The  parts  currently 
planned  for  the  extended  memory  (EM)  have  large  enough  geometries 
that  the  soft-error  rate  is  very  low.  In  addition,  the  EM  does 
not  contain  any  long-term  data.  Hence,  no  scrubbing  is  necessary 
or  planned  for  the  EM. 

6.1.5  Error  Detection  and  Correction  in  the  Connection  Network 

The  Connection  Network  (CN)  is  of  central  importance  in  the  imple- 
mentation of  the  proposed  PMP.  Since  the  design  of  intercon- 
nection systems  are  generally  not  as  well  understood  as  processors 
and  since  there  appears  to  be  less  redundancy,  the  planned  defen- 
ses against  erroneous  operation  are  described  in  some  detail 
below. 
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6.1 .5.1  Magnitude  of  the  problem 

As  described  later  ^ single  transient  errors  are  self--correcting  in 
the  use  of  the  CN.  The  only  faults  that  might  cause  problems  are 
hard  failures  (i*e.  permanent  failures)*  The  discussion  below 
shows  that  hard  failures  can  always  be  detected  during  execution 
of  user  program  (and  therefore  by  implication  detectable  during 
confidence  tests).  Section  6.1.10.3r  which  follows^  shows  that 
these  faults  are  diagnosable  once  the  job  in  process  has  been 

aborted. 

As  to  the  magnitude  of  the  problem^  the  CN  is  built  from  39^280 
identical  LSI  circuits.  If  these  circuits  have  the  ^normal” 

failure  rate  of  0.1  failures  per  million  hours,  the  expected  MTBF 
will  be  254  hours,  or  33  failures  per  year.  During  the  entire  10 
year  design  life  of  the  FMP  330  failures  are  expected.  With  the 
fault  detection  and  isolation  techniques  outlined  below,  it  is 
very  unlikely  that  one  of  these  expected  330  failures  will  be 
undetected. 

6. 1.5. 2 Defense  vs.  Type  of  Fault 

6.1. 5. 2.1.  Single  transient  error  in  the  request  sent  to  EM.  A 

single  transient  failure  Th  module  number,  address,  or  opcode 

field  causes  the  EM  module  to  detect  a parity  error,  which  causes 

the  processor  to  retry  the  operation  in  question.  System  software 
normally  allows  retries  to  proceed  unmolested. 

6. 1.5. 2. 2.  Single  transient  error  in  data.  This  is  corrected  by 
the  error  correcting  code  and  logged.  Computation  proceeds. 

6. 1.5. 2. 3.  Hard  failure  on  the  path  from  one  processor  to  EM. 
This  hard  failure ~will  either  cause  parity  errors  to  be  detected 
by  the  EM  or  SECDED  errors  in  any  words  stored.  The  analysis  in 
section  6.1.5. 3 shows  that  over  half  of  the  addresses  sent  through 
the  fault  are  detected  as  errors.  The  result  is  that  such  an 
error  will  be  detected  very  quickly. 

6. 1.5. 2. 4.  Hard  failure  on  the  data  path  from  EM  ^ processor . 
Only  data,  with  SECDED,  flows  over  this  path.  The  analysis  of  the 
next  section  shows  that  over  two-*thirds  of  the  faulty  data  words 
are  detected.  Thus,  such  a failure  is  quickly  detected,  usually 
on  the  first  word  transmitted  after  the  failure  occurs. 

6. 1.5. 2. 5.  Hard  failure  in  the  path-selecting  control  logic . 
Here,  there  are  several  cases  to  consider. 

First,  if  the  wrong  path  is  selected,  and  if  the  wrongly  selected 
EM  module  has  a different  number  of  bits  in  its  CN  port  number,  a 
parity  error  is  detected  at  the  EM  module.  Half  the  EM  modules 
will  detect  such  a parity  error,  so  that  EM  accessing  will  not  go 
on  for  long  without  the  error  being  detected. 
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Second,  the  correct  path  is  selected,  but  with  wrong  priority,  so 
that  a particular  processor  is  being  discriminated  against.  The 
program  will  continue  to  execute  correctly,  but  execution  time 
will  be  lengthened  for  certain  patterns  of  access  conflicts  in  the 
CN.  We  believe  that  analysis  will  show  that  such  priority 
failures  are  harmless  for  some  programs,  including  aero  flow 
codes,  but  no  simulations  to  verify  this  expectation  have  yet  been 
done.  Such  failures  can  be  found  by  diagnostics.  All  processors 
are  sent  to  fetch  from  EM,  execution  is  allowed  to  proceed  for  a 
fixed  time,  and  then  it  is  observed  that  the  processors  with 
correct  results  are  not  the  expected  set. 

Third,  the  strobe  line  is  falsely  high.  This  will  cajse  the  CN 
buffer  to  think  that  the  EM  module  has  granted  access  when  in  fact 
it  has  not.  If  there  are  no  CN  delays,  the  correct  word  will  come 
back  in  spite  of  the  fault.  When  there  are  delays,  the  CN  buffer 
will  pull  in  "garbage"  since  no  real  word  is  coming  back  at  the 

time  the  false  acknowledge  says  it  is.  Since  the  path  from  this 

CN  buffer,  if  blocked,  is  blocked  for  at  least  one  CN  clock,  that 
garbage  is  either  all  zeroes  or  all  ones,  for  which  the  Hamming 
error  correction  identified  bit  63  and  bit  56  respectively  as  the 
bit  in  error.  Since  there  is  no  such  bit,  the  error  is  immediate- 
ly caught. 

6, 1.5. 3 Analysis 

As  described  previously  in  Chapter  5,  the  Connection  Network  is 
designed  to  transmit  a sequence  of  11-bit  frames.  The  main 
purpose  of  this  approach  is  to  reduce  the  number  of  wires  and  the 
complexity  of  the  network  itself.  if  the  entire  message  is  33 
bits  long,  then  a stuck-at  fault  will  change  either  0,  1,  2,  or  3 

bits  depending  on  whether  those  bits  were  the  same  value  as'-the 

bit  produced  by  the  stuck-at  fault  or  not.  A stuck-at-ONE  fault 
produces  no  errors  when  all  the  bits  were  ONE  to  start  with.  When 
the  entire  message  is  55  bits  long,  the  stuck-at  fault  jams  five 
successive  bits  to  the  state  at  which  the  fault  is  stuck,  produc- 
ing 0,  1,  2,  3,  4,  or  5 errors. 

First  consider  the  case  that  the  module-number /address/opcode  is 
being  passed  to  the  EM  (33  bits)  and  the  bit  of  the  EM  module 
number  is  the  same  as  the  value  at  which  the  fault  is  stuck.  The 
remaining  two  bits  can  have  either  0,  1,  or  2 errors.  When  the 
remaining  bits  are  address  bits,  it  appears  valid  to  assume  that 
they  behave  as  random  bits.  Hence  we  have  25%  of  the  time  no 
error,  50%  of  the  time  a single  error  that  is  detected  by  parity 
failure,  and  25%  of  the  time  a double  error  that  is  not  detected. 
Exactly  two  thirds  of  the  errors  are  detected.  Aften  ten  addres- 
ses have  been  passed  through  this  fault,  the  probability  of  the 
error  being  detected  is  99.9988%;  after  twenty,  99.99999997%. 
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2.  Take  the  case  as  above,  except  the  EM  module  number  bit  is 

wrong.  When  parity  is  checked,  at  the  wrong  EM  module  this  time 
there  will  be  either  1,  2,  or  3 errors  in  the  33-bit  package, 

again,  with  probability  25%,  50%  and  25%.  The  single  and  triple 
errors  result  in  parity  errors  and  are  detected.  Thus,  exactly 

one  half  of  the  errors  are  detected.  After  ten  addresses  have 
passed  through  this  fault,  the  probability  of  the  error  being 
detected  in  99.9%  after  twenty,  99.9999%,  again  on  the  random 
assumption  for  addresses. 

3.  For  the  third  case,  data,  the  analysis  is  of  the  same  kind, 
but  there  are  more  cases  Hence,  it  is  easier  to  present  the 
analysis  in  the  form  of  a table  for  the  cases  that  there  are  0,  1, 
2,  3,  4,  or  5 errors.  For  each  possible  number  of  errors,  Table 
6.2  shows  the  percentage  of  time  we  expect  to  find  such  error, 
(the  binomial  distribution)  when  it  occurs,  the  percentage  of  time 
that  this  hardware  error  causes  no  operational  failure  (for  ex- 
ample, a single  error  is  corrected  using  the  SECDED  code),  the  per- 
centage of  time  that  this  number  of  errors  is  detected,  and  the 
percentage  of  time  that  a single  data  word  can  slip  by  in  error. 
14.6%  of  the  triple  errors  will  be  detected,  and  85.4%  of  them 
will  appear  to  be  correctible  single  errors  and  therefore  not 
detected. 


Table  6.2 

Single  55-bit  Data  Word  passing  through  CN  with  single  hard  fault 


No.  bit 
er ror s 

Occur rence 

No . Func . 
Failure 

Error 

Detected 

Er  ror 

Undetected 

0 

3.12% 

to 

- 

1 

15.62% 

15.62% 

- 

- 

1 2 

31.25% 

- 

31.25% 

3 

31.25% 

\ 

4.56% 

' 

26.69% 

4 

\ 

j 15.62% 

1 

i 

1 

15.62% 

- 

5 

1 

! 3.12% 

i 

i 

' 0.46% 

1 

\ 

2.67%  ' 

i 

■-  . 

jTOTAL 

1 

( 

; 18.75% 

i 

1 51.89% 

1 

: 29.36% 

Prom  the  table,  we  see  that  the  ratio  of  detected  failures  to 
undetected  failues  is  51,89/29.36.  That  is,  64%  of  all  the  func- 
tional failures  are  detected.  After  ten  words  have  passed  through 
this  hardware  fault,  the  probability  that  the  fault  has  been 
detected  is  99.9999%,  assuming  random  data. 

6, 1,5. 6 Log ical  Checks 

Miscellaneous  logical  checks  can  be  considered.  The  design  intent 
of  these  checks  is  to  localize  the  effects  of  some  error  in  the 
FMP.  The  following  list  of  checks  includes  those  also  listed  in 
Appendix  C in  the  list  of  interrupts,  plus  others. 

- Parity  checks  on  microprogram 

- Memory  bounds  checks  (optionally  inserted  by  compile} ) 

- "Uninitialized"  word  fetched  to  instruction  decoder  or 
floating  point  unit 

- Illegal  opcode 

- Detection  of  unnormalized  floating  point  operand  (except 
second  half  of  double  length  floating  point) 

- Integer  overflow  or  underflow 

- Divide  by  integer  zero 

Floating  point  overflow  (either  tested  for  or  marked 
"unrepresentable",  a compile  time  option) 

- Timeout 

In  addition,  there  is  a bit  in  the  interrupt  register  reserved  for 
any  miscellaneous  logic  malfunction  checks  that  will  be  built  into 
the  hardware.  Lock-up  of  the  end-around  carry  chain  is  an  example 
of  the  sort  of  logic  error  whose  occurrence  would  be  reported  in 
this  bit. 

6.1.7  Re star  t 

Previous  analysis  (5)  shows  that  the  optimum  time  between  re- 
start dumps  is  given  by 

Topt  = (2TgTr)*5 

where  Tg  is  the  mean  time  between  failures  (intermittent  or  hard) 
that  cause  an  abort,  and  where  T^  is  the  time  spent  taking  one 
restart  dump  and  also  the  time  required  to  load  the  restart  point 
and  switch  to  user  programming,  assumed  equal. 
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Typical  runs  are  10  minutes,  and  typical  data  bases  are  15  x 10^ 
words  (5)*  Since  15  x 10^  words  are  loaded  into  EM  in  0.375  sec, 
we  have  Tj»  not  more  than  about  0.5  sec.  Since  Tg  will  be  on  thf 

order  of  10  hours  or  longer,  we  have  Topt  « (2  • 0.5  • 36000)^ 
seconds,  or  just  over  3 minutes.  However,  the  amount  of  compu- 
tation time  saved  by  dividing  a typical  10  minute  run  into  three 
restar  table  segments  is  estimated  to  be  about  0.3%  of  all  FMP 
time^.  Hence  there  is  little  point  to  providing  restart  points 
within  the  tyical  10-minute  run. 

Unless  restart  points  are  provided  by  the  user,  the  restart 
strategy  will  be  to  restart  the  same  task  again  automatically 
under  software  control.  Automatic  restart  is  limited  to  those 
aborts  that  are  probably  caused  by  hardware  error,  such  as  parity 
errors.  Aborts  that  are  likely  to  be  software  errors,  such  as 

addressing  errors,  will  not  trigger  automatic  restarts.  The 

operating  system  handles  automatic  restarts,  and  reports  their 
occurrence. 

The  two  types  of  restart  dumps  mentioned  above  should  be  deline- 
ated Automatic  restart  dumps  are  likely  to  be  a roll-out  (with  a 
later  roll-in)  of  the  entire  job.  In  this  case,  all  data  space, 

variables,  flags,  et.  al.  would  be  dumped  to  the  file  system  via 

DBM.  Restart  points  provided  by  the  user  are  expected  to  be  more 
restricted.  The  user  would  be  permitted  to  specify  selected  data 
areas  to  be  affected  and  to  specify  when  such  snapshots  are  to  be 
taken.  These  user  selected  restart  dumps  would  be  much  more  effi- 
cient and  cause  considerably  less  load  on  the  system  than  the  auto- 
matically generated  dumps.  In  addition,  the  user  will  be  per- 
mitted to  insert  an  alternate  entry  point  in  his  main  program  (a 
restart  point),  where  appropriate  arrays  from  the  restart  dumps 
would  be  reloaded.  In  the  initially  delivered  system,  automatic 
checkpoint  restart  transparent  to  the  user  will  not  be  included. 
Such  a facility  would  be  included  at  a later  time. 


iTopt  Is“about  200  seconds;  Tj^  is  0.5  seconds.  At  optimum,  the 
fraction  of  time  lost  due  to  restart  dumps  is  approximately  equal 
to  the  time  lost  due  to  wasted  computation,  that  is,  computation 
that  ends  in  an  abort.  In  a 10  minute  run  there  are  two  0.5  sec 
restart  dumps,  plus  the  initial  loading  of  the  data  base  (total 
1.5  sec)  and  an  equal  expected  amount  of  time  lost  by  aborting 
good  computation.  (1.5  sec.  + 1.5  sec.)/600  sec  reduces  net 

throughput  to  99.5%  of  what  it  would  have  been  if  defense  against 
aborts  were  not  necessary.  If  no  restart  dump  is  taken  during  the 
10  minute  run,  the  1.5  sec  of  data  moving  is  reduced  to  0.5  sec. 
However,  the  time  lost  from  wasted  computation  will  triple,  since 
600  seconds  is  triple  200  seconds.  Hence,  the  percentage  of  time 
lost  is  (0.5  sec  + 4.5  sec.)/600,  and  net  throughput  is  99.2% 
instead  of  99.5%.  This  is  a small  price  to  pay  for  the  conven- 
ience of  not  having  to  worry  about  restarts. 


Since  the  Intent  of  on-line  spare  components  is  to  provide  the 
capability  of  maintaining  the  desired  level  of  performance  through 
aucomatic  reconf iguration^  the  system  software  would  be  able,  in 
most  cases,  of  automatically  restarting  a job  whose  execution  may 
have  been  interrupted  by  a hard  failure.  When  the  job  is  a 
program  with  user-specified  restart  dumps  and  restart  points,  the 
most  recent  restart  dump  would  automatically  be  chosen  and  the 
execution  would  resume  at  the  restart  point.  For  example,  in  a 
one-hour  run  involving  500  time  steps,  it  might  be  reasonable  to 
specify  a restart  dump  every  25th  time  step.  The  restart  entry 
point  would  include  the  reinitialization  of  control  variables  to 
states  saved  as  part  of  the  restart  dump.  The  above  technique  is 
particularly  appropriate  to  the  aero  flow  codes,  where  the  computa- 
tion converges. 

An  analysis  of  the  effect  of  restart  on  the  operation  of  the  FMP, 
using  the  reliability  model,  is  contained  in  Section  6.2.  In  that 
section,  the  assumed  “restart"  time  of  6 minutes,  corresponds,  not 
to  the  Tr  above,  but  to  the  total  time  spent  at  the  time  of 
restart,  including  system  software  response  to  the  abort,  logging 
of  error,  reconfiguration  of  the  system,  if  any,  and  running  of 
confidence  (and  possible  diagnostics  as  well  depending  on  the  type 
of  error  detected).  Six  minutes  seems  extremely  generous. 

^*^♦8  Error  Logging 

Where  possible,  all  errors  are  logged.  The  mechanism  for  logging 
errors  is  via  interrupt.  Both  the  processor  and  the  coordinator 
have  three  classes  of  interrupts  which  can  be  used  for  logging. 

One  class  of  interrupts  reports  non-fatal  errors  (such  as  single- 
error correction  of  a transient  parity  error  detected  in  EM). 

A second  class  of  interrupts  are  the  programmatic  interrupts 
(CALLI  instruction)  which  can  be  used  for  calling  on  system 
software  to  log  errors.  In  many  cases  these  may  be  errors 
detected  by  tests  inserted  by  the  compiler  into  the  code  stream. 

The  third  class  of  interrupts  is  used  to  log  all  fatal  errors. 
Since  fatal  errors  involve  some  non-correc table  situation,  these 
interrupts  are  usually  directed  to  the  coordinator,  in  the  case 
of  the  coordinator  itself,  they  are  directed  to  the  diagnostic 
controller  and  the  support  processor. 

The  design  intent  is  to  record  the  memory  address  and  bit  number 
of  bit  in  error  (also  called  "syndrome")  for  all  SECDED  error 
corrections  and  detections.  It  is  likely  that  programmatic  inter- 
rupts will  report  not  only  the  observed  error  condition,  but  also 
a code  which  would  be  used  to  obtain  a link  back  to  the  original 
source  code.  A table  in  each  memory  holds  the  record  of  the  last 
N errors  corrected.  The  size  of  the  error  log  tables  and  the 
frequency  with  which  they  are  collected  and  reported  has  yet  to  be 
determined. 
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6.1.9  Invariants 


Applications  software  is  one  of  the  links  in  the  chain  that  main- 
tains the  trustworthiness  of  the  NASP*  Although  the  application 
software  is  outside  of  the  scope  of  work  when  developing  a faci- 
lity, it  is  part  of  the  system  seen  by  the  user  and  therefore  must 
be  considered  when  discussing  the  trustworthiness  of  a system. 
Inclusion  of  checks  on  quantities  which  should  be  invar  lent  or  in 
some  way  well  behaved  during  the  course  of  the  computation  seems 
appropriate*  Examples  might  be: 

- Total  quantity  of  air  within  the  mesh  (as  c<  ^nputed  from  the 
appropriate  function  of  geometry  and  pressure)  should  change 
in  accordance  with  air  inflow  and  outflow  at  the  boundary. 

- Any  global  criterion  for  convergence  should  improve  monoton- 
ically  for  steady  airflows. 

- Changes  in  total  energy  in  the  system  should  correspond  to 
energy  inflow  and  outflow  at  boundaries 


Discussions  are  currently  under  way  on  constructs  for  the  language 
which  would  make  such  invariant  checking  more  convenient. 

6*1.10  Diagnostics 

All  of  the  FMP  shall  be  diagnosable.  Creating  diagnostics  is 
difficult  at  best,  because  of  its  interdisciplinar y nature.  Hard- 
ware features  for  aiding  diagnostics  must  be  designed.  The  diag- 
nostic programmer  must  be  expert  both  in  the  logic  design  of  the 
machine  being  diagnosed,  and  expert  in  programming  at  the  machine 
dependent  level.  Completely  automatic  diagnostics  for  all  condi- 
tions is  an  unreasonable  goal.  This  project  would  plan  on 
computer-assisted  diagnostics. 

The  built-in  fault  detection  mechanisms  of  the  FMP  have  already 
been  discussed.  In  order  to  meet  the  desired  goals  and  avail- 
ability and  MTTR  (Mean  Time  to  Repair)  for  the  system,  direct  and 
simple  means  for  diagnosis  of  the  system  components  is  required. 
Because  of  the  scope  of  the  system,  direct  control  of  diagnostics 
from  some  central  point  (the  diagnostic  controller  (DC)  for  in- 
stance) is  not  realistic.  A hierarchy  of  controls  will  be  pro- 
vided. In  general,  every  diagnostic  interface  to  the  next  level 
of  detail  in  the  system  is  expected  to  have  a mode  of  operation 
which  allows  the  outer  level  direct  control  over  setting  and 
observing  any  state  (bits)  in  the  immediate  next  level  of  detail. 
For  example,  the  logic  in  the  diagnostic  controller  (DC)  would  be 
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tested  (set  DC  state  including  comraand,  run  the  DC  clock,  observe 
the  DC  state)  by  the  support  processor.  The  diagnostic  control- 
ler, in  turn,  tests  the  state  control  logic  of  the  coordinator  in 
the  same  way.  The  coordinator  tests  its  own  memory  and  the  state 
control  logic  of  each  of  the  processors.  The  processors  and 
coordinator  jointly  check  the  Connection  Network  and  the  EM 
modules.  The  processors  will  also  be  able  to  check  each  other. 
The  coordinator  also  tests  the  state  control  logic  of  the  Data 
Base  Memory  controller.  The  DBM  controller  then  tests  the  rest  of 
the  DBM  including  the  path  to  the  Pile  System. 

Figure  6*2  shows  the  layered  structure  of  the  off-line  diagnos- 
tics. Layer  1 is  the  initial  phase  of  the  “hard  core”,  when  the 
Support  Processor  is  learning  to  trust  the  command-accepting 
portion  of  the  DC.  Layer  2 is  the  rest  of  the  "hard  core",  also 
imposed  by  a Support  Processor  program,  checking  out  the  DC  and 
enough  of  the  coordinator  so  that  the  coordinator  can  be  trusted 
to  execute  successfully.  Layer  3 runs  on  the  coordinator  and 
exercises  that  portion  of  the  FMP  to  which  the  coordinator  has 
direct  access.  Layer  4 consists  of  those  portions  of  the  FMP  to 
which  the  coordinator  has  only  indirect  access.  The  coordinator 
must  cause  the  DBM  controller  and  the  processor  to  execute  certain 
operations  in  order  to  get  these  portions  exercised.  Some 
diagnostics  for  layer  4 run  on  the  DBM  controller  and  the 
processors  as  an  array. 

This  on-line  form  of  diagnostics  is  used  as  needed  to  isolate  or 
confirm  an  error  to  a replaceable  unit  (such  as  a processor).  At 
that  point  the  system  is  reconfigured,  checked  and  execution 
resumes.  If  the  automatic  diagnostics  are  unable  to  confirm  the 
location  of  a fault,  the  same  controls  are  accessible  to  the  main- 
tenance personnel  who  can  develop  custom  tests  sequences  as 
required.  Note  that  when  the  system  successfully  detects  a fail- 
ure, isolates  it  to  a system  component,  swaps  in  a spare  compon- 
ent and  resumes  execution  without  requiring  manual  intervention, 
the  system  is  defined  to  be  continuously  available.  Only  when 
manual  intervention  is  required  to  isolate  a problem  and  restart 
the  system  is  the  system  considered  to  be  unavailable. 

Once  a failure  has  been  isolated  to  a system  component,  such  as  a 
particular  processor,  and  that  component  has  been  switched  "off- 
line”, isolation  of  the  bad  component  can  proceed  concurrently 
with  the  resumption  of  execution  of  the  FMP.  These  off-line  tests 
would  consist  of  two  types.  Some  tests  will  be  possible  with  the 
system  component  still  attached  to  the  system.  These  tests  would 
allow  ”in-situ”  testing  without  disturbing  the  environment  in 
which  the  error  occurred.  Thus,  spare  system  components  will  be 
capable  of  access  to  other  spare  components  without  disruption  of 
the  on-line  portion  of  the  FMP. 

In  addition  to  the  above  test  modes,  test  equipment  is  expected  to 
be  available  to  test  the  removable  system  components  away  from  the 
system. 
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Figure  6*2  FMP  Block  Diagram  with  Diagnostic  Layers  Superimposed 


6.1.10.1  Level  of  Performance 


One  should  be  aware  that  there  is  no  single  magic  date  on  which 
the  diagnostics  will  be  "finished**.  The  delivery  date  for  the 
diagnostic  software  will  merely  mark  a time  at  which  the  diagnos- 
tics achieved  have  a useful  level  of  accuracy.  On  that  date, 
there  will  be  still  room  for  improvement  in  the  diagnostics. 
Diagnostic  programs  should  continue  to  improve  as  operating 
experience  shows  up  unanticipated  failure  modes  and  shows  the 
areas  in  which  the  accuracy  of  the  diagnostics  can  be  improved. 
The  goal  is  to  achieve  the  highest  possible  uptime  with  the  least 
amount  of  time  lost  to  either  downtime  or  lost  to  running  diagnos- 
tics. It  will  be  important  to  continue  to  fund  diagnostic  develop- 

ment at  a modest  level  of  effort  for  the  life  of  the  NASP  in  order 
to  continually  improve  the  efficiency  of  the  support  operations 

and  to  reflect  the  design  updates  and  changes  which  are  a normal 

part  of  the  life  of  any  system. 

The  initial  capabilities  of  the  automatic  diagnostics  system  have 
yet  to  be  defined.  The  automatically  executed  diagnostics  would 
detect  X%  of  all  possible  failures.  The  goal  of  this  set  of  auto- 
matic diagnostic  programs  is  to  isolate  faults  to  the  least  re- 
placeable unit  at  the  FMP  level  (coordinator  or  CN  card,  proces- 
sor, EM  module,  ...).  The  diagnostics  shall  locate  the  failure  to 
a single  LRU  Y%  of  the  time,  and  shall  locate  the  failure  to 

within  N LRU’s  Z%  of  the  time.  When  a failure  could  be  either  on 

the  backplane  or  on  a LRU,  the  probability  of  detecting  whether  or 
not  it  is  on  the  LRU  itself  or  in  the  backplane  behind  the  LRU 

will  drop  to  W%. 

The  off-line  LRU  diagnostics  (tester  programs),  shall  localize 

failures  to  the  chip,  or  to  some  number  of  chips  with  similar  per- 
centages. U%  of  the  failures  shall  be  found,  V%  shall  be  local- 
ized to  within  N chips,  and  T%  shall  be  localized  to  within  one 
chip  or  component.  All  of  the  above  percentages  need  to  be  deter- 
mined. 


6.1.10.2  NASF  Computer-Assisted  Diagnostic  Tools 

A diverse  set  of  diagnostics  will  be  implemented  for  the  NASF. 

6.1.10*2.1.  Support  Processor  System  Diagnostics.  The  Mainten- 
ance Diagnost ic  Un it , a separate  execution  unit  ^of  the  Support 
Processor  system,  can  impose  diagnostic  operation  on  any  off-line 
elements  of  the  Support  Processor.  The  MDU  can  write  information 
into  any  flipflop  of  the  Support  Processor,  cause  the  unit  to 
execute  any  number  of  clocks,  and  then  read  the  state  of  any 
flipflops.  Results  are  then  compared  to  precomputed  results. 


6*1.10*2.2.  Suppor  t Processor  Peripheral  Exercisers.  Programs 
resident  on  the  Support  Processor  exercise  the  peripherals  of  the 
Support  Processor. 

6.1.10.2.3.  FMP  Off-line  Diagnostics.  These  diagnostics  are  used 
when  the  PMP  is  not  executing  user  programs  and  is  considered  **off- 
line”  in  terms  of  production  commitments.  These  diagnostic 
procedures  execute  throughput  the  PMP  depending  on  their  purpose. 

The  ”hard  core”  of  these  diagnostic  procedures  is  a program  resi- 
dent on  the  Support  Processor  exercising  the  FMP  via  the  DC. 
During  early  debugging , the  DC  will  be  available  before  the  diag- 
nostics have  been  written,  and  some  diagnostic  capability  will 
exist  by  controlling  the  DC  manually  from  the  maintenance  console. 

After  the  coordinator  has  been  diagnosed  (or  after  confidence  has 
been  gained  in  the  coordinator),  most  of  the  rest  oy,  3IMP  diagnos- 
tics will  run  in  the  coordinator.  These  run  much  faster  than  DC 
diagnostics  do.  The  analysis  portion  of  all  those  diagnostics 
runs  in  the  Support  Processor.  The  coordinator  will  check  the 

viability  of  each  of  the  EM  module  controls.  The  EM  modules  will 
be  exercised  in  detail  as  part  of  the  CN  test. 

Each  processor  will  check  its  memory.  CN  diagnostics  require  the 
execution  of  EM  accesses  from  a number  of  processors  acting  in 
concert.  The  CN  diagnostics  therefore  occupy  the  entire  array, 
just  as  does  a user  program. 

6.1.10.2.4.  Off-line  LRU  Diagnostics.  These  diagnostic  programs 
execute  on  the  test  equipment.  Every  LRU  can  be  diagnosed  to  the 
chip  level,  or  exercised  with  sufficient  flexibiltiy  that  the 
technician  can  diagnose  to  the  chip  level.  The  number  of 
different  types  of  testers  which  may  be  required  is  yet  to  be 
determined.  All  testers  are  expected  to  be  program  compatible 
with  each  other,  so  that  one  language  creates  tests  for  all  of 
them.  That  test  generation  language  would  be  linked  to  the  design 
data  base. 


6.1.10.2.5  PAL  (Programming  Aid  for  Logicians)  ♦ PAL  is  the 
language  in  which  simple  tests  can  be  written  on-the-spot  for 
execution  by  the  DC,  or  for  execution  on  the  B7800  for  exerting 
control  over  the  array  via  the  DC.  Eventually,  the  PAL  programs 
would  form  a library  that  would  continue  to  be  useful  after 
delivery,  especially  for  the  small  residue  of  failure  modes  which 
the  automatic  diagnostics  do  not  adequately  support. 


6.1.10.2.6  Analysis  of  Logged  Errors.  Tables  which  contain  the 
error  logs  would  be  periodically  collected  and  provided  to  a 
program  which  analyzes  and  summarizes  the  error  activity  in  these 
logs.  This  program  would  execute  on  the  Support  Processor. 
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6.1.10.3  CN  Diagnostics 

The  Connection  Network  represents  a design  which  is  novel  when 
compared  with  previously  existing  circuitry  for  which  diagnostics 
have  been  generated.  The  diagnostic  approach  described  below 
would  allow  the  FMP  to  isolate  faults  to  the  bit  and  node  within 
the  CN.  Since  the  approach  is  a successive-refinement  technique, 
some  savings  may  be  gained  by  stopping  the  FMP  automatic  diag- 
nostic at  the  board  level  (the  replaceable  unit)  and  isolating  the 
failed  chip  using  off-line  test  equipment. 

6.1.10.3.1  Assumptions  and  Design  Requirements.  The  following 
features  of  the  FMP  design,  and  of  the  CN  portion  of  it,  are  the 
basis  for  what  follows. 

- SECDBD  is  checked  on  the  data.  The  checking  is  performed 
in  processor  or  coordinator 

- Parity  is  checked  on  the  addresses  and  operation  codes  sent 
from  processor  or  coordinator  to  a single  EM  module. 

- The  Omega  network,  from  which  the  CN  design  is  derived,  has 
one  and  only  one  path  between  a port  on  one  side  and  a port 
on  another,  so  that  when  an  error  is  detected,  the  path 
through  the  network  taken  by  that  erroneous  data  is  known. 

- When  processor  number  and  EM  module  number  are  known,  the 

operating  system  can  translate  these  numbers  into  CN  port 
number  on  the  processor  side  and  CN  port  number  on  the  EM 
side.  In  general,  errors  will  be  reported  by  processor 

number  and  EM  module  number , whereas  the  diagnostics  need  to 
know  physical  CN  port  numbers.  This  translation  needs  to  be 
done  not  only  for  CN  diagnostics,  but  also  for  processor  and 
EM  module  diagnostics  as  well. 

6.1.10.3.2  Localizing  a Hard  Error  in  the  CN.  The  analysis  of 
the  CN  starts  with  the  analysis  of  the  Omega  network.  The  argu- 
ment will  then  be  expanded  to  the  more  complex  case  of  the  actual 
CN.  Between  port  n on  one  side  and  port  m on  the  other  side, 
there  is  a fixed  path.  All  traffic  between  these  two  ports  takes 
the  same  path.  Between  some  other  ports  n'  and  m'  there  is  also  a 
fixed  path.  None  or  some  of  the  nodes  on  the  path  n'-m'  are  the 
same  as  the  nodes  on  the  path  n-m.  Inspection  of  the  four  binary 
numbers  n,  m,  n*  , and  m' , bit  by  bit,  will  disclose  in  which  of 
the  ten  levels  of  logic  in  the  CN  do  these  paths  have  common 
nodes.  By  choosing  two  paths  n-m  and  n'-m*  which  have  some  nodes 
in  common,  and  finding  that  the  same  error  occurs  in  data  travers- 
ing both  such  paths,  we  localize  the  fault  to  those  nodes. 
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Given  a particular  path  from  m to  n,  and  knowing  which  contiguous 
levels  of  the  ten  logic  levels  we  want  to  include  in  some  other 
path^  we  make  m’  different  from  m for  all  those  bits  corresponding 
to  levels  on  the  left  side  of  the  Omega  for  which  the  path  is  not 
to  be  identical,  and  we  make  n*  different  from  n for  all  those 
bits  corresponding  to  levels  on  the  right  side  of  the  Omega  for 
which  the  path  is  not  to  be  identical* 

Presumably  the  diagnostics  will  be  written  using  a binary  search 
strategy*  First  we  run  tests  in  which  the  faulty  path  and  the 

other  path  have  four  nodes  in  common,  then  tests  in  which  they 
have  two  nodes  in  common,  then  one* 

In  the  preferred  CN  version,  there  are  two  Omega  networks,  not 

one,  with  the  result  that  the  path  is  unique  only  to  within  a pair 
of  nodes,  (one  in  the  upper  Omega,  one  in  the  lower  Omega)  at  each 

point*  Two  paths  that  must  intersect  in  the  simple  Omega  can  pass 

each  other  without  using  the  same  gates,  if  one  uses  the  node  in 
the  Upper  Omega  network  and  the  other  uses  the  node  in  the  lower 
Omega  network.  The  CN  would  be  designed  to  inhibit  this  redun- 
dancy* If  the  two  Omega  networks  communicate  only  at  the  ports  (a 
version  that  was  simulated  on  the  CN  simulator),  we  use  a diagnos- 
tic control  that  disables  either  the  upper  or  the  .ower  Omega 
while  the  other  one  is  exercised* 

If  the  two  Omega  networks  allow  paths  to  be  connected  between 
upper  and  lower  network  at  each  pair  of  nodes,  then  diagnostic 
disable/enable  controls  are  needed  on  both  Omegas  at  all  ten  node 
levels,  twenty  such  signals  in  all,  so  that  at  each  node  level  one 
can  force  a path  to  stay  in  the  same  (upper  or  lower)  Omega  net- 
work, or  force  it  to  jump  (from  upper  to  lower  or  vice  versa). 
With  these  controls,  all  paths  can  be  exercised  under  the  same 
diagnostic  scheme  as  described  for  a single  Omega. 

The  error  detection  used  by  the  diagnostics  is  the  SECDED  check  on 
words  that  have  been  sent  through  the  CN  in  one  direction  or  the 
other,  and  the  parity  check  on  addresses  and  commands  sent  from 
processors  to  EM*  Now  a given  SECDED  error  could  be  due  to  an 
error  on  the  path  to  EM  during  a write,  or  due  to  an  error  in  the 
EM  module  itself,  or  could  be  due  to  an  error  on  the  path  from  EM 
to  processor.  The  diagnostics  must  distinguish  between  these 
several  cases*  A test  on  EM  module  M consists  of  writing  into  the 
possibility  of  faults  in  the  CN) * Whe i the  memory  is  checked  out, 
the  diagnostics  can  tell  the  two  directions  in  the  CN  by  sending 
data  between  the  EM  module  and  several  different  processors.  To 
make  sure  the  failure  is  not  a write  failure  if  the  read  appears 
to  fail,  write  commands  to  the  EM  would  be  generated  from  several 
processors*  Likewise,  redundant  reading  is  used  to  check  for 
write  failures* 
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The  diagnostics  must  detect  the  case  that  the  EM  port  number  is 
being  erroneously  interpreted  in  an  EM  access  request  so  that  the 
fault  also  causes  one  to  traverse  the  wrong  path*  This  case  is 
detected  by  the  parity  check  at  the  EM  module  which  covers  module 
number  as  well  as  address  and  opcode* 

6*1*10*3*3  Diagnostic  Generation  Scheduling  The  schedule  for 
creating  diagnostics  Is  contrained  by  the  requirements  of  fabr id- 
eation/ debugging,  and  system  integration*  The  first  facility 
needed  is  the  test  equipment,  with  enough  of  the  test  equipment 
software  written  to  facilitate  manufacturing  acceptance  testing  of 
the  LRUs  as  they  are  built*  The  first  LRUs  built  are  the  proces- 
sor and  the  DC  boards*  Fabrication  generally  follows  the  same 
sequence  as  the  diagnostics:  the  DC  is  completed  first,  the 

coordinator  is  completed  before  the  last  processor  is  plugged  in, 
EM  integration  (including  the  CN)  follows  successful  processor 
operation,  and  DBM  is  the  last  item  to  integrate*  However, 

because  of  the  number  of  processors  involved,  processors  must  be 
among  the  first  components  fabricated.  On  this  basis,  we  see  that 
the  sequence  of  creation  of  the  diagnostics  is 

1,  Tester  and  tester  programs  start  first*  The  first  tester 
programs  written  are  for  DC  boards  and  processor* 

2*  Processor  on-line  diagnostics  to  run  on  the  processor 
while  the  processor  is  on  the  tester*  This  is  an  early 
version  of  the  same  processor  diagnostic  test  used  for  FMP 
automatic  diagnostics 

3*  The  PAL  assembler*  This  is  used  to  generate  tests 
on-the-spot  by  the  debugging  logicians  as  they  debug  the 
coordinator,  the  fanout  boards,  and  the  DBM  controller 

4*  On-line  diagnostic  tests  are  used  to  verify  proper  design 
and  operation  of  the  FMP*  The  on-line  tests  are  used  as  part 
of  the  acceptance  tests* 


6*2  RELIABILITY,  AVAILABILITY  AND  MAINTAINABILITY 

The  efforts  in  reliability,  availability  and  maintainability 
during  this  study  addressed  the  following  key  areas: 

. The  effects  of  redundancy  and  parts  quality  on  the 
FMP  reliability. 

♦ An  updated  and  refined  reliability  and  availability 
analysis  of  the  PMP  and  the  NASF  system 

. An  estimate  of  the  maintenance  manpower  required  to 
support  the  FMP 

The  redundancy  study  showed  that  the  use  of  redundant  processors 
and  extended  memory  modules  and  redundancy  in  the  data  base  memory 
provided  significant  improvements  in  PMP  availability  and  especial- 
ly mean  up  time  (MUT)  . The  level  of  redundancy  studied  is  now 
incorporated  in  the  PMP  architecture  presented  in  Chapter  5.  The 
use  of  B-2  quality  level  components  versus  C level  quality  was 
also  shown  to  make  a significant  improvement  (B-2  and  C level 
component  quality  represent  levels  of  quality  resulting  from  dif- 
ferent degrees  of  testing  and  screening;  discussed  in  more  detail 
in  Section  6.2.3).  The  refined  predictions  of  FMP  (and  NASF)  reli- 
ability are  based  on  the  incorporation  of  these  conclusions.  The 
results  of  the  refined  reliability  analysis  of  the  NASF  are  summar- 
ized in  Table  6.3  which  presents  mean  up  time,  mean  down  time  and 
availability  of  the  three  major  subsystems  of  the  NASF. 

The  refined  reliability  analysis  of  the  PMP  considered  a range  of 
failure  rates  for  the  LSI  memories,  as  well  as  a range  of  improve- 
ment resulting  from  the  application  of  SECDED,  a range  of  inter- 
mittent or  **soft“  failure  rates,  and  a range  of  efficiencies  for 
recovery  from  interruptions.  The  results  of  this  analysis  provide 
three  levels  of  reliability  for  the  PMP.  A lower  bound  (or  worst 
probable  case)  , a probable  case  and  an  upper  bound  (or  best 
probable  case).  The  results  shown  in  Table  6.3  are  for  the 
probable  case. 


TABLE  6.3 

NASF  AVAILABILITY  ANALYSIS 


FMP 

FILE 

MANAGEMENT 

SUBSYSTEM 

SUPPORT 

PROCESSOR 

SUBSYSTEM 

COMPOSITE 

MEAN  UP  TIME 

14*9  HRS 

19,310  HRS 

263.0  HRS 

14.1  HRS 

MEAN  DOWNTIME 

0.14  HRS 

1.9  HRS  ! 

0.68  HRS 

0.17  HRS 

AVAILABILITY 

0.9904 

.9999 

.9974 

.9880  j 
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An  examination  of  the  PMP  reliability  analysis  results  reveals 
that  the  connection  network  (with  no  redundancy)  has  a signficant 
impact  on  the  FMP  failure  rate.  Similarly  the  reliability  of  the 
data  base  memory  becomes  a determining  factor  in  the  FMP  reliabi- 
lity if  only  the  lowest  probable  SECDED  improvement  factors  are 
achieved  and  the  failure  rates  of  the  LSI  memory  circuits  (256K) 
are  no  better  than  that  assumed  for  the  worst  probable  case. 
Future  efforts  regarding  the  PMP  should  give  careful  consideration 
to  these  two  areas  to  achieve  optimum  PMP  reliability. 

The  maintenance  analysis  revealed  that  to  have  a 95%  confidence  of 
meeting  the  required  repair  and  maintenance  actions  of  the  FMP  for 
any  given  seven  day  week,  a minimum  of  13  maintenance  personnel 
working  five  shifts  each  must  be  available  (65  8-hour  shifts). 
Assuming  21  shifts  per  week  (3  shifts  per  day  x 7 days  per  week), 
the  maintenance  of  the  FMP  will  require  an  average  of  3 persons 
per  shift,  excluding  operators,  administrative  and  supervisory 
personnel. 

6*2.1  Reliability/Availability  Model 

A generalized  systems  model  for  predicting  the  reliability  and 
availability  for  a computer  system  includes  many  elements.  Figure 
6*3  describes  this  general  model  for  NASP.  There  are  five  major 
elements;  facility,  personnel,  software,  hardware  and  miscel- 
laneous. While  all  of  these  elements  impact  the  ultimate  system 
availability,  the  analysis  and  predictions  conducted  at  this  time 
consider  only  the  hardware  and  some  interruptions  of  a “soft"  or 
inter mittant  nature  contributed  by  the  other  elements. 

This  model,  as  well  as  the  reliability  block  diagrams  of  NASF 
elements  illustrated  later  on  in  this  chapter,  illustrate  the 
inter -dependency  of  the  subelements  that  contribute  to  the  reliabi- 
lity of  the  system  under  consideration. 
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Figure  6.3  General  Reliability/Availability  Systems  Model  for  NASF 
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The  NASP  hardware  includes  five  elements,  (1)  the  PMP;  (2)  file 
management  subsystem;  (3)  support  processor  subsystem;  (4)  the 
data  communication  subsystem  and  (5)  the  test  and  maintenance 
equipment.  The  data  communications  subsystem  consists  of  a large 
number  (over  100)  of  terminal  interfaces,  modems,  networks  and  I/O 
devices.  The  data  communication  processors  are  included  in  the 
support  processor  subsystem.  Failure  of  any  one  of  the  devices  or 
interfaces  in^  the  data  communication  subsystem  has  no  impact  on 
the  availability  of  the  NASP  for  the  other  devices  and  does  not 
impact  the  availability  of  the  rest  of  the  system,  therefore  the 
data  communication  subsystem  portion  of  the  NASP  hardware  was 
excluded  from  the  study. 

The  availability  of  the  test  and  maintenance  equipment  can  be 
adjusted  to  a level,  through  the  use  of  redundant  equipment,  that 
will  have  little  impact  on  the  overall  system  availability.  The 
remaining  hardware  elements  of  the  NASP  (1)  the  PMP,  (2)  the  file 
management  subsystem  and  (3)  the  support  processor  subsystem,  are 
addressed  in  this  analysis.  Detailed  models  (reliability  block 
diagrams)  of  each  of  these  elements  are  provided  later  in  this 
chapter . 

Programs  developed  by  the  Burroughs  Corporation  to  aid  in  design- 
ing fault-tolerant  computers  were  used  with  the  above  models  to 
determine  the  system/subsystem  reliability/availability/maintain- 
ability. Details  of  these  programs  and  definitions  of  terms  are 
included  in  Appendix  D. 

6.2.2  Redundancy  Study 

The  PMP  architecture  consists  of  parallel  elements  in  a number  of 
areas.  Parallelism  readily  permits  the  use  of  redundancy  for 
improving  availability.  Redundancy  however  can  also  impact  equip- 
ment and  maintenance  cost,  increase  failure  rate  and  frequently 
increases  the  software  and  operations  complexity.  An  analysis  was 
conducted  to  compare  the  effects  of  the  application  of  redundancy 
to  the  PMP  in  three  areas;  (1)  the  processors,  (2)  extended 
memory,  and  (3)  data  base  memory.  These  areas  represent  three  of 
the  major  areas  of  the  PMP. 

Calculations  of  the  mean  up  time  (MUT),  mean  down  time  (MDT)  , and 
availability  (A),  of  the  PMP  were  made  with  various  combinations 
of  redundancy.  The  level  of  redundancy  used  is  that  discussed  in 
Chapter  5 This  includes  4 on-line  processors  resulting  in  a 
redundancy  of  128  required  out  of  129  processors  available  in  each 
of  4 processor  bays;  4 on-line  extended  memory  modules  resulting 
in  130  out  of  131  extended  memory  modules  for  3 of  the  4 extended 
memory  bays  and  131  out  of  132  extended  memory  modules  in  the 
forth  bay  and  a partitioning  of  the  data  base  memory  into  4 
sections  of  which  any  2 are  required  for  the  PMP  to  be  available. 
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Table  6,4  shows  the  results  of  this  reliability  and  availability 
analysis  for  eight  different  combinations  of  redundancy  (listed  as 
Cases  1 through  8)  • The  power  of  redundancy  in  improving  mean  up 
time  and  availability  can  be  seen  by  reviewing  these  results  which 
are  summarized  in  figure  6 *4.  The  use  of  redundancy  in  only  one 
area  makes  only  a modest  improvement  in  the  mean  up  time  and 
availability.  Use  of  redundancy  in  two  areas  increases  the  mean 
up  time  and  availability,  somewhat  more.  The  use  of  redundancy  in 
all  three  areas  results  in  a significantly  higher  mean  up  time  and 
availabil ity . 


TABLE  6.4 

Effect  of  Redundant  Elements  on  PMP  Reliability 

REDUNDANT  ELEMENTS  rU*VVN  UP  TIME  MEAN  DOWN  TIMi:  AVAILABILITY 

FXimiKI)  DATA  BASE  ( HiX)RS)  ( MaiRS) 

CASE  PHOCESSORS  MEMORY  Ml-MORY 


1 

NO 

NO 

NO 

10*2 

0.65 

.9403 

2 

YES 

NO 

NO 

15,6 

0.43 

,9730 

3 

NO 

YES 

NO 

12.7 

0.25 

.9449 

4 

NO 

NO 

YES 

16.2 

0.72 

.9575 

5 

YES 

YES 

NO 

22.3 

0.51 

.9878 

6 

YES 

NO 

Yt:s 

36.2 

0.33 

.9908 

7 

NO 

YES 

YES 

23.5 

0.92 

.9622 

8 

YES 

YES 

YES 

117.9 

0.51 

.9956 

It  should  be  pointed  out  that  the  data  base  used  for  these 
calculations  does  not  include  all  the  factors  used  in  the  analysis 
reported  elsewhere  in  this  chapter.  The  results  presented  in  this 
section  should  only  be  used  for  ascertaining  the  sensitivity  of 
the  FMP  reliability  and  availability  to  redundancy. 

The  conclusion  of  this  study  was  that  the  application  of  redun-- 
dancy  to  these  three  areas  to  the  extent  defined,  represent  a 
significant  improvement  in  FMP  reliability  and  availability.  The 
predicted  reliability  and  availability  values  for  the  FMP  and  NASF 
presented  in  this  chapter  are  predicated  on  the  use  of  this 
redundancy. 
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Figure  6,4  Effects  of  Redundancy  on  FMP  Mean  Up  Time 


6 • 2 • 3 Component  Quality  Study 

Various  levels  of  component  quality  are  available  for  fabricating 
electronic  systems.  These  levels  are  achieved  through  the  applica- 
tion of  certain  screening  and  testing  procedures  as  called  out  in 
various  government  specifications  and  statements.  Levels  most 
likely  to  be  considered  for  the  FMP  are  B-2  and  C,  The  B-2  level 
represents  the  vendors  equivalent  of  a number  of  these  screening 
and  testings  procedures  including  a 168  hour  burn-in.  The  C level 
has  less  stingent  tests  and  no  burn-in  but  is  done  specifically  to 
the  government  specifications  (See  reference  [9]  for  more 
information) . 

The  effect  on  FMP  reliability  of  using  B-2  level  versus  C level 
quality  components  was  investigated.  Table  6*5  presents  these 
results.  Four  cases  were  analyzed.  The  quality  level  was  varied 
for  the  FMP  with  non  redundancy  and  with  the  level  of  redundancy 
presented  in  section  6.2*2  above.  It  is  noted  that  the  higher  the 
Mean  Up  time  the  greater  the  impact  of  component  quality.  The 
conclusion  of  this  study  is  that  if  a high  reliability  in  terms  of 
mean  up  time  is  desired,  higher  quality  (B-2  level)  components 
should  be  used.  The  predictions  of  the  FMP  and  NASF  reliability 
and  availability  presented  later  in  this  cahpter  are  predicted  on 
the  use  of  B-2  level  quality  components. 


Table  6.5 

Effects  of  Component  Quality  on  FMP  Reliability 


CASE 

QUALITY 

LEVEL 

MEAN  UP 

REDUNDANCY  TIME 

(HOURS) 

MEAN  DOWN 
TIME 
(HOURS) 

AVAILABILITY 

1 

B-2 

NONE 

10.2 

0.65 

.9403 

2 

C 

NONE 

8.9 

0.68 

.9296 

3 

B-2 

YES 

117.9 

0.51 

.9956 

4 

C 

YES 

75.0 

0.51 

.9932 

6.2.4 

FMP  Reliabil 

ity  and 

Availability 

Prediction 

Since  the  FMP  is  the  most  complex  element  of  the  NASF  hardware  and 
since  the  concept  under  consideration  involves  highly  state-of-the- 
art  technologies,  a more  detailed  analysis  has  been  conducted  on 
this  element.  As  described  earlier,  a number  of  factors  have  been 
considered  in  the  FMP  analysis.  The  value  of  these  factors  are 
varied  over  a range  to  provide  an  upper  and  lower  bound  as  well  as 


probable  values  for  the  reliability  and  availability.  Figure  6.5 
shows  the  reliability/availability  block  diagram  used  for  the  FMP. 
In  addition  to  the  redundancy  shown,  a B-2  quality  level  (in 
accordance  with  MIL-HD3K-217B)  [9]was  assumed  for  the  integrated 
circuits  and  6 minute  recovery  time  assumed  for  manual  operator 
restart.  The  mean  times  to  repair  (MTTR)  are  based  on  past  exper- 
ience and  the  estimated  complexity  for  isolating  and  correcting 
a failure  in  the  various  elements. 

Figure  6.5  points  out  the  major  redundant  elements  of  the  FMP.  It 
should  be  noted  that  no  redundancy  is  shown  in  the  connection 
network.  The  reliability/ availability  analysis  assumes  a single 
layer  network.  The  connection  network  presented  in  Chapter  5,  is 
double  layer  network.  A double  provides  some  unknown  level  of 
redundancy  since  one  of  the  purposes  of  the  double  layer  is  to 
provide  alternate  paths  where  blocking  occurs  between  the  pro- 
cessors and  extended  memory.  At  least  some  failures  will  appear 
as  blocking  to  the  network.  Therefore  some  degree  of  redundancy 
(or  fault  tolerance)  is  available  in  the  double  layered  network. 
Since  the  degree  of  redundancy  from  the  double  layer  network 
cannot  be  identified  and  taken  into  consideration  in  these 
analyses,  a single  layer  network  and  the  component  count  of  a 
single  layered  network  was  assumed. 

The  failure  rates  of  the  individual  FMP  elements  were  determined 
by  using  a tentative  parts  list  for  each  element.  The  quantity 
and  failure  rates  for  each  component  are  then  applied  to  straight 
forward  calculations  which  result  in  the  element  failure  rate  (or 
mean  time  between  failures).  Appendix  E contains  the  figures 
listing  the  data  and  the  resulting  element  failure  rates.  The 
failure  rates  of  these  elements  and  their  estimated  mean  time  to 
repair  are  then  used  with  the  DESIGN  Program,  described  in 
Appendix  D,  along  with  other  factors  to  be  described,  to  predict 
the  FMP  reliability  and  availability. 

Not  all  of  the  factors  that  impact  the  reliability  and  availabi- 
lity of  the  FMP  can  be  readily  delineated.  Pour  factors  were 
selected  for  which  a range  of  values  could  be  projected  and  used 
for  the  FMP  reliability  predictions.  These  four  factors  which  are 
discussed  in  the  following  sections  are: 

(1)  LSI  Memory  Failure  Rate 

(2)  SECDED  Improvement  Factors 

(3)  Ratio  of  Permanent  Failures  to  Intermittent  Failures 

(4)  Recovery  Efficiency 

6. 2. 4.1  LSI  Memory  Failure  Rates 

Actual  field  data  on  LSI  memory  failure  rates  is  relatively 
sparse.  Some  data  is  available  on  16K  devices  [8].  Reliability 
models  such  as  those  in  MIL-HDBK-217B  [9]  for  predicting  device 
failures  generally  do  not  hold  for  significant  increases  in  com- 
plexity and  density.  A worst  probable  case  (lower  bound)  failure 


6-31 


DATA  BASE  MEMORY  BAY 


POWER  SUPPLY 


Figure  6.5  FMP  Reliability/Availability  Block  Diagram 
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rate  may  be  assumed  by  using  the  conservative  estimate  of  A 

failure  per  million  hours  (PPMH)  for  a 16K  device  which  is  equiva- 
lent to  the  failure  rate  of  four  4K  devices  (which  have  an 

accepted  failure  rate  of  •!  PPMH).  Using  this  same  philosophy^ 

lower  bound  failure  rates  for  the  64K  and  256K  were  set  at  1.6 

PPMH  and  6.4  PPMH  respectively. 

Por  an  upper  bound  (best  probable  case)  , a value  of  .1  PPMH  was 
set  for  all  three  memory  devices.  Curves  showing  the  improvement 
of  MOS  memory  devices  failure  rates  with  maturity  tend  to  be 
asymptotic  to  a value  in  the  range  of  .1  PPMH  regardless  of  the 
density  [8] . 

The  most  probable  failure  rate  was  determined  using  the  model  in 
MIL-HDBK-217B  for  the  16K  device  and  then  doubling  that  value  for 
each  quadrupling  of  memory  sizes.  This  process  results  in  failure 
rates  of  the  16K  device  being  .32  PPMH/  the  64K  device  being  .64 
PPMH  and  the  256K  device  being  1.28  PPMH. 


6. 2. 4. 2 SECDED  Improvement  Factor 

Improvements  in  reliability  of  the  FMP  are  made  through  the  appli- 
cation of  Single-Bit  Error  Detection  and  Correction  and  Double-Bit 
Error  Detection  (SECDED)  in  the  FMP  memories.  The  mathematical 
model  discussed  in  Appendix  B of  reference  [2]  determined  that 
gains  could  vary  from  a lower  bound  (worst  probable  case)  of  2 to 
an  upper  bound  (best  probable  case)  of  164  for  16K/  327  for  64K 
and  653  for  256K  memory  packages.  These  two  bounds  represent  the 
extremes  of  the  probable  SECDED  improvement.  It  is  anticipated 
that  the  real  value  will  fall  somewhere  within  this  range.  For 
the  purpose  of  this  analysis  a value  of  50  has  been  selected  as 
being  the  most  probable  SECDED  improvement  factor. 

The  SECDED  improvement  factor  is  applied  to  the  reliability  anal- 
ysis by  direct  division  of  the  memory  devices  failure  rates  by  the 
improvement  factor.  Note  that  application  of  the  improvement 
factor  to  the  memories  circuit  alone/  however  does  not  consider 
that  SECDED  also  corrects  transient  error  that  may  occur  from 
other  sources.  For  example/  transient  single  bit  errors  occur ing 
in  the  connection  network/  or  due  to  software  errors  or  due  to 
noise  problems  in  data  being  transmitted  to  a memory  may  be  cor- 
rected through  SECDED. 

6. 2. 4. 3 Ratio  of  Permanent  Failures  to  Interraittant  Failures 

Burroughs  field  data  has  shown  that  the  ratio  of  the  mean  time 
between  permanent  failures  (MTBF(P))  to  the  mean  time  between 
intermittent  failures  (MTBF(I))  is  estimated  to  vary  over  the 
range  of  about  10  to  1 to  1 to  1.  These  values  have  been  selected 
for  the  lower  and  upper  bound  and  the  ratio  of  5 to  1 selected  for 
the  most  probable  bound.  The  value  of  5 to  1 for  the 
MTBP(P)/MTBF( I)  corresponds  to  the  assumption  that  5 out  of  6 
failures  are  due  to  intermittents. 
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6. 2.4*4  Recovery  Efficiency 

The  FMP  like  other  large  systems  should  be  able  to  automatically 
recover  from  intermittent  failures  and  in  some  case  permanent 
failures.  The  system  recovery  should  be  designed  with  the  goal  of 
being  100%  efficient,  that  is  to  say  that  100%  of  the  time  after 
an  interruption  of  the  system  automatically  reconfigures  and 
restarts  with  negligible  down  time,  unfortunately  most  systems  do 
not  enjoy  this  idealized  goal.  Experience  shows  that  recovery 
efficiency  varies  and  ranges  in  the  levels  of  70%  to  almost  100%. 
Xhese  levels  (70%  and  100%)  were  selected  as  the  lower  and  upper 
bounds  and  a level  of  30%  selected  for  the  predicted  level. 

6. 2. 4. 5 FMP  Reliability  Analysis  Results 

The  values  of  the  various  factors  discussed  above  were  used  as 
inputs  to  the  FMP  model  and  reliability  analysis  program.  Figure 
6.6,  6.7  and  6.8  present  the  input  data  and  the  calculated  results 
of  this  analysis  for  the  lower  bound  (worst  probable  case),  the 
probable  case  and  the  upper  bound  (best  probable  case).  The  input 
data  include  the  following: 


1)  Name:  Abbreviated  name  of  an  FMP  element 

2)  R:  Minimum  number  of  elements  required  for 

FMP  to  be  available. 

3)  N:  Number  of  identical  elements  available 

4)  MTBP(P):  Mean  time  between  permanent  failures 

5)  MTBF(I)t  Mean  time  between  intermittent  failures 

6)  SPFM:  Single  point  failures  (not  used  in  this 

analysis) 

7)  DRT:  Device  repair  time 

8)  SRT;  Single  point  repair  time  (not  used  in  this 
analysis 

9)  RE(P) : Recovery  efficiency  from  permanent  type 

failures 

10)  RE(I):  Recovery  efficiency  from  intermittent  type 

failures. 

11)  DMRT:  Device  manual  recovery  time  (assumed  to  be 

♦1  hours  for  the  FMP) 

12)  MTBME:  Meantime  between  maintenance  errors  (not  used 

in  this  analysis) 

13)  MTBPM:  Mean  time  between  maintenance  actions  (not 

used  in  this  analysis) 

14)  MTTPM:  Mean  time  to  perform  preventive  maintenance 

(not  used  in  this  analysis) 

The  output  data  consist  of  the  following  three  items: 

1)  MUT;  Mean  up  time 

2)  MRT:  Mean  repair  time  (which  for  the  system  being 

analyzed  will  be  the  same  as  the  Mean  down  time  (MDT) 

3)  Avail:  Availability  - Percent  of  time  that  system  or 

required  elements  are  available  for  use. 
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Unless  otherwise  noted,  all  times  are  expressed  in  hours*  More 
discussion  on  these  terms  can  be  found  in  Appendix  D* 

Examination  of  this  data  shows  that  the  two  major  areas  having 
greatest  potential  impact  on  the  PMP  reliability  are  the  connec- 
tion network  (CN)  and  the  database  memory  (DBM)*  The  connection 
network,  which  for  this  analysis  is  assumed  to  have  no  redundancy, 
has  the  lowest  MUT  for  both  the  most  probable  case  and  the  best 
probable  case.  In  the  worst  probable  case  the  connection  network 
has  the  second  lowest  MUT,  the  lowest  MUT  being  that  of  the  data 
base  memory  (DBM).  Two  factors  contribute  to  the  low  MUT  for  the 
DBM  in  this  case;  the  failure  rate  of  6.4  FPMH,  and  a SECDED 
improvement  factor  of  only  2. 

Conclusions  from  this  analysis  indicated  that  redundancy  should  be 
implemented  for  the  connection  network.  Furthermore,  special 
attention  should  be  paid  to  the  design  and  application  of  SECDED 
to  the  data  base  memory  and  in  obtaining  LSI  memory  circuits  with 
a failure  rate  significantly  less  than  6.4  FPMH. 

Table  6.6  summarises  the  reliability  analysis  results  for  the  FMP 
and  shows  the  values  of  the  factors  considered  in  the  different 
cases. 

6.2.5  Support  Processor  and  File  Management  Subsystems 

Figures  6.9  and  6.10  show  the  reliability  block  diagrams  of  the 
support  processors  and  file  management  subsystem.  The  high  level 
of  redundancy  in  these  systems  contributes  significantly  to  its 
overall  reliability.  For  the  purpose  of  this  analysis,  the 
failure  rates  of  the  individual  models  include  hard  and  inter- 
mittant  failures.  The  failure  rate  data  used  for  the  support 

processor  elements  and  the  disk  packs  and  file  control  ements  of 
the  file  management  subsystem  are  obtained  from  current  field 
experience  on  similar  systems.  The  equipment  that  might  be  used 
in  the  1980* s,  though  faster  and  of  greater  capacity  than  that  in 
the  field  now,  is  expected  to  have  reliabilities  and 

availabilities  that  will  equal  or  exceed  that  of  these  current 
systems. 

Figure  6.11  lists  the  data  used  for  support  processor  subsystem. 
Computed  outputs  for  the  mean  up  time  (MUT),  mean  down  time  (MDT) , 
availability  and  the  meantime  between  interr options  for  the  indi- 
vidual items  and  the  total  support  processor  subsystem  are  shown 
in  Figure  6*12.  The  CONFIGURE  program  described  in  Appendix  D was 
used  to  generate  these  results.  An  example  of  actual  field  data 
of  a similar  system  is  provided  for  comparison  in  Figure  6.13. 
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Figure  6.9  Support  Processor  Subsystem  Reliability /Availability 
Block  Diagram 
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Figure  6*10  File  Management  Subsystem  Reliability/Availability 
Block  Diagram 
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Figure  6,11 

Support  Processor  Reliability  Data 
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Figure  6-13  Fxample  of  Actual  Field  Data  on  a Large 
£iystem  Similar  to  the  Support  Processor 


Table  6*7  lists  the  data  used  for  the  reliability/ availability 
analysis  of  the  file  management  and  the  subsystem  results  of  this 
analysis.  The  MUT  and  MDT  use  for  the  mass  memory  are  based  on 
design  specifications. 

Table  6.7 

File  Management  Subsystem  Reliability 


Data 

and  Analysis 

Results 

ELEMENT 

1 - 

N 

. MUT  (HRS) 

MDT (HRS) 

A 

File  Control 

f 

^ 1 

- 

L.  . 

19,310 

I 

Disk  Packs 

3 

6 

250 

■ 1.0 

Mass  Memory 

1 1 

1 

5,246 

) 

1.8 

System  Total 

i 

i 

19,310 

1.9 

.9999 

R « Required  number  of  elements 
N ~ Number  Available 
MUT  » Mean  Up  Time 
MDT  ~ Mean  Down  Time 
A = Availability 

(Data  from  experience  on  similar  equipment  or  design 
specif ications) 


6.2.6  Maintenance 

6. 2. 6.1  Maintenance  Philosophy 

Maintenance  of  the  PMP  should  be  based  on  a remove  and 
replace-with~spare  philosophy  at  the  lowest  replaceable  unit  (LRU) 
level  as  determined  by  the  maintenance  analysis.  Repair  of  the 
replaced  failed  items  would  be  off-line  using  subassembly  testers 
available  at  the  site.  The  PMP  should  be  equipped  with  fault 
detection  circuits  that,  in  conjunction  with  system  confidence 
checks  and  diagnostics,  would  provide  indications  of  an  existing 
problem  via  a printout  or  status  display.  Errors  can  be  logged 
automatically  giving  appropriate  file  information  for  isolation  of 
failure(s).  Upon  detecting  a fault,  the  maintenance  personnel  can 
initiate  the  isolation  action  required  (hardware/software/manual 
diagnostics)  to  locate  the  fault  to  the  malfunctioning  subassembly 
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within  an  element  or  LRU.  A replacement  subassembly  will  be  with- 
drawn from  spares  and  substituted  in  the  PMP  element.  Before 
restoring  the  element  to  an  active  status,  a confidence  check 
would  be  performed  to  determine  if  the  failure  has  been  corrected. 
The  malfunctioning  subassembly  can  then  forwarded  to  the  appro- 
priate repair  facility  (site,  depot  or  factory).  Upon  repair  at 
the  site,  the  LRU  can  be  returned  to  the  spares  stock. 

The  remove  and  replace  philosophy  requires  that  adequate  spares  be 
stocked  on-site  to  preclude  degradation  of  the  PMP  performance 
parameters  (MUT,  MDT,  Availability).  The  actual  quantity  and 
types  of  spares  required  and  the  lead  times  should  be  determined 
from  their  actual  usage  in  the  equipment  and  their  individual 
failure  rates. 

Preventive  maintenance  of  the  PMP  consist  of  periodic  testing  of 
the  power  supplies,  checking  of  rotating  memories  and  general 
housecleaning.  This  effort  would  be  minimal,  and  most  of  it  can 
be  accomplished  on-line. 

6.2.6 .2  Maintenance  Plan 

Upon  detection  of  a failure  the  system  diagnostic  can  be  auto- 
matically initiated  to  determine  the  malfunctioning  element.  The 
system  automatically  reconfigures  under  program  control  replacing 
the  malfunctioning  element  if  it  is  redundant.  Maintenance  diag- 
nostics can  be  initiated  to  isolate  the  failure  to  the  mal- 
functioning subassembly  for  removal  and  replacement  by  a spare  and 
the  process  manually  restarted  if  the  failed  element  is  not 
redundant.  The  design  approach  being  investigated  would  allow 
removal/replaceraent  of  redundant  modules  with  power-on.  This 
approach  would  tend  to  reduce  the  equipment  downtime  by  allowing 
more  rapid  access  to  the  failed  items. 

Since  SECDED  is  applied  to  the  memories,  most  single-bit  errors 
will  not  cause  any  equipment  failure.  When  the  log  shows  that  a 
single  bit  is  stuck,  the  system  could  be  shut  down  when  desired  in 
an  orderly  fashion  for  maintenance  action.  This  feature  would 
provide  a minimum  loss  of  productive  time.  The  information  stored 
in  the  log  could  then  be  processed  on  an  as-called  basis  for 
location  of  the  failure  or  error.  The  system  diagnostic  would 

utilize  the  following  means  for  error  detection  and  error 
correction: 

a)  Processor  Module 

PaYity  check  on  Microprogram  Memory 
. Reasonableness  checks  (See  Appendix  C for 
detailed  list) 

b)  Data  Base  Memory 

. Error  correction  with  logging  of  errors 
for  detecting  repeated  faults 
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c)  Connection  Network 


. Error  correcting  codes  as  part  of  the  data  plus 
parity  checks  on  address  and  instructions  to 
memory 

d)  Coordinator 

» SECDED  in  the  memory 

. Reasonableness  checks  (See  Appendix  C) 

e)  Memor ies 
• SECDED 


f )  Power  Supplies 


• Detection  of  over  voltage  on  input  line  will 
cause  the  FMP  to  automatically  shut  down  to 
prevent  damage  to  the  equipment 


. Detection  of  voltage  out  of  range  on  output 


V 


6 *2. 6. 3 Personnel  Support  Requirements 

Detection,  isolation,  repair  and  checkout  of  a failure  in  the  NASF 
System  requires  individuals  with  knowledge  and  experience  of 
digital  processing  equipment.  These  individuals  should  have  a 
thorough  understanding  of  electronic  principles,  systems  logic  and 
solid  state  component  operation  as  applicable  to  high  speed 
digital  data  processing  equipments.  They  should  also  have  a 
thorough  understanding  of  electronic  test  equipment  operation,  and 
reading  schematics,  logic,  wiring  diagrams  and  blueprints.  Their 
background  should  include,  at  a minimum,  a high  school  education 
and  training  in  an  advanced  electronics  digital  data  processing 
and  computer  maintenance  course.  Maintenance  personnel  should 
possess  experience  in  the  installation,  repair,  overhaul  and 
modification  of  high  speed  digital  data  processing  systems  and  be 
familiar  with  the  test  equipment  applications  associated  with  the 
accomplishment  of  these  tasks. 

An  analysis  has  been  conducted  to  ascertain  the  level  of  manpower 
required  to  provide  for  repair  and  maintenance  of  the  FMP.  The 
results  of  this  analysis  shows  that  to  have  a 95%  confidence  a 
meeting  the  required  actions  within  the  times  allocated  a minimum 
of  13  maintenance  personnel,  each  working  5 shifts  per  week  arc 
required. 

Estimates  of  the  personnel  support  (labor  hours)  requirements  for 
the-  NASF  System  provided  in  this  section  assumes  the  type  of  main-* 
tenance  personnel  described  above.  The  estimates  are  based  on  95% 
upper  confidence  bounds  applied  to  element  failure  rates  and  the 
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weighted  average  repair  time  of  a given  subsystem*  (An  upper 
confidence  bound  of  95%  applied  to  the  element  failure  rates  means 
that  for  95%  of  the  time  the  failures  of  a given  element  will  be 
within  this  bound.)  The  basic  steps  followed  to  determine  these 
estimates  are: 

a.  Determine  the  average  number  of  failure  expected  in 
a 168  hour  operational  week  lor  a given  element  in 
a given  subsystem* 

b.  Determine  the  expected  number  of  failures  at  the 

95%  upper  confidence  bound  for  each  element  and  the 
corresponding  subsystem  total* 
c*  Determine  the  weighted  average  equipment  repair  time 
for  the  given  subsystem* 

d*  Determine  the  labor  hours  expected  at  a 95%  confi** 
dence  level  to  be  expended  in  performing  corrective 
maintenance  (CM)  (on-line-localization,  isolation, 

LRU  removal  and  replacement) * 
e*  Estimate  the  labor  hours  for  performing  preventive 
maintenance  ( PM) * 

f*  Estimate  the  labor  hours  for  LRU  repair  off-line 
(bench  repair)* 

g.  Estimate  the  total  labor  hours  required  per  shift* 


Steps  a and  b 


Table  6*8  shows  the  average  number  of  repair  actions  expected 
weekly  as  computed  for  each  equipment  in  the  PMP,  File  and  Support 
Processor  subsystems*  The  weekly  (168  hours)  period  was  chosen 
because  it  best  satisfies  operational  conditions.  The  smallest 
value  of  n;  that  satisfies  the  Poisson  formula  condition  given  in 
equation  6*1  determines  the  maximum  number  of  repair  actions  at 
95%  confidence  for  the  jth  element. 


where  mj 


i = 0 


QtyO)  x t = average  number  of  repair  actions 
MTBF(j)  for  jth  element  in  time  t 


(6  1) 


Table  6*8  shows  the  input  values  of  Qty(j)  and  MTBF(j)  for  each  of 
the  j elements  and  the  calculated  values  mj  and  n;  for  t=168 
hours.  The  subsystem  total  for  the  PMP  subsysten^  3hows  an  average 
of  about  11.5  repair  actions  per  week,  but  at  a 95%  confidence 
level  there  will  be  no  more  than  33  repair  actions  per  weeks*  The 
corresponding  values  for  the  file  and  support  processor  subsystems 
are  about  12  and  37  repair  actions  per  week  for  the  average  and 
95%  confidence  bound  respectively. 
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Determine  the  weighted  average  equipment  repair  time  for  the  given 
subsystem. 


The  mean  time  to  repair  for  each  of  the  j elements  is  tabulated  in 
Table  6.8  as  MTTR(j).  The  mean  time  to  repair  a subsystem,  MTTR, 
is  obtained  as  a weighted  average  of  the  MTTR(j),  The  weighing  is 
done  using  the  quantity  Qty{j) , and  inversely  as  the  mean  time 
between  failures,  MTBP(j),  of  the  jth  element  since  these  deter- 
mine the  frequency  with  which  repairs  of  the  jth  element  comes  up 
for  repair.  The  appropriate  formula  for  the  subsystem  mean  timeto 
repair  MTTR,  for  corrective  maintenance  is: 


MTTR 


MTTE(iL,x__Qty(jl 

MTBFQ) 

Qty(j) 

MTBFU) 


(6.2) 


When  the  values  in  Table  6.8  are  used  for  j«l  to  10,  the  weighed 
average  equipment  repair  time  (for  corrective  maintenance)  for  the 
FMP  is  0.618  hours,  and  for  j=ll  to  20,  the  mean  time  to  repair 
for  the  file  and  support  processor  is  2.2  hours. 


Steps  d thru  g 

A good  approximation  to  the  distribution  of  mean  repair  time  is  a 
normal  distribution.  Thus,  it  then  follows  that  the  general 
equation  for  determining  the  manpower,  personnel  hours  PCM,  for 
corrective  maintenance  expected  to  be  expended  at  95%  confidence 
can  be  expressed  as: 


‘cm 


^MTTR  + 


1.6948501  \ 


nP 


(6.3) 


P = Number  of  maintenance  personnel 

n = Number  of  repair  actions 

<r  = Standard  deviation  of  repair  times 

= 0,25  hours  (as  determined  from  observed  data 

taken  on  similar  equipment 


Table  6.8 

Number  of  Repair  Actions  Per  Week  for  NASF  System  Elements 
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Table  6.9 


CORRECTIVE  MAINTENANCE  LABOR  HOUR  ESTIMATES 


Subsystem 

No.  of  Main- 
tenance Per- 
sonnel. (P) 

No.  of 
Repair 

Actions  wk/(n) 

Labor  Estimates  at  95% 
(Maintenance  Personnel 
Hours/Wk),  (Pcm)__ 

PCM 

FMP 

1 

9 

6,7949 

2 

16 

23.0630 

3 

8 

18.3193 

Subtotal : 

33 

48.1772 

Support  Pro 

— 

1 

10 

23.3019 

cessor  and 

File 

Systems 

2 

19 

88-6648 

3 

8 

56.2929 

Subtotal 

37 

168.2596 

Experience  shows  that  27.9%  of  all  equipment  failure  corrective 
action  is  performed  with  one  (1)  maintenance  person,  49.6%  with 
two  (2)  maintenance  personnel,  and  22.5%  with  three  (3)  mainten- 
ance personnel.  Substitution  of  these  values  for  P into  equation 
6. 3 yields  the  results  in  Table  6.9  . 

The  results  shown  in  Table  6.10  are  based  on  the  following  assump- 
tions: (1)  previous  field  experience  for  repair-off-line 

utilizing  a subassembly  tester  indicates  a two  hours  repair  time 
per  equipment  failure.  (2)  The  amount  of  time  selected  for 
preventive  maintenance  (PM)  are  also  based  on  previous  field 
experience.  However,  it  is  to  be  noted  that  the  final  time  values 
for  PM  can  be  better  determined  once  the  PM  procedures  are 
devloped. 

The  final  results  are  adjusted  to  consider  the  efficiency  of  the 
personnel.  An  80%  personnel  efficiency  is  assumed  to  cover  con- 
tingencies such  as  set-up  times,  breaks,  report  writing  and  other 
documentation  requirements,  etc.  These  results  indicate  that, 
with  a 95%  confidence,  thirteen  (13)  maintenance  personnel  can 
adequately  support  the  HASP  Computing  System  working  5 shifts  each 
during  a 21  shift,  seven  day  week,  or  an  average  of  about  3 
persons  per  shift. 


Additional  personnel  should  be  considered  to  account  for  time  off 
and  shift  rotation  within  established  personnel  policies.  The 
above  manning  level  does  not  include  those  personnel  required  for 
supervision^  administration/  software  support/  system  operation 
and  maintenance  of  the  data  communication  displays/  terminals  and 
other  I/O  equipment. 


TABLE  6.10 

ESTIMATED  KASF  MAINTENANCE  LABOR  REQUIREMENT 


Subsystem 


Labor  Required 

Maintenance  (Maint.  Personnel 

Activity Hours) 


FMP 


CM 

PM 

Repair  Off-Line 


48.18 

14.00 

66.00 


Subtotal : 


128.18 


File  and 

Support 

Processor 


CM 

PM 

Repair  Off-Line 


168.26 

28.00 

74.00 


Subtotal : 


270.00 


TOTAL 


398.44 


At  80%  Efficiency 


498.05  hours/week 


6. 2. 6. 4 Sparing  Considerations 


An  important  condition  to  the  acquisition  and  maintenance  of  any 
system  is  the  philosophy  of  sparing  partS/  assemblies/  and  sub- 
systems to  support  the  specified  system  operational  requirements. 

Sparing  considerations  are  developed  as  a result  of  an  overall 
logistics  support  study  which  takes  into  account  requirements  such 
as; 
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- System  MUT,  MTTR  and  Availability^  ^ 

Redundancy  considerations, 

- Recovery  time  of  repair ables, 

- System  maintenance  philosophy, 

- Hardware  complexity, 

- Corrective/Preventive  maintenance  skill  requirements. 

Site,  depot  or  factory  repair, 

- Special  and  standard  test  equipment  or  tools, 

required  at  the  site,  depot  or  factory  locations 

- Storage  facilities  (space,  environment,  etc*). 

Distance  from  spare  part  supply  points, 

- Turn-around  time  for  repair  on  site,  depot  and  factory 

("Pipeline”  time) , 

- Packaging  for  long  term  storage, 

Shelf  life, 

- Long  term  availability  of  discrete  parts  due  to  technology 

advances,  etc* 

- Cost  tradeoff  studies  of  repair  at  piece  part  versus 

assembly/subsystem  level  on  throwaway. 

Identification  of  wear  out  items  replaced  at  specific 
intervals* 

The  maintainability  characteristics  of  any  system  backed  up  by  the 
reliability,  availability  and  performance  requirements  determine 
the  system  effectiveness,  logistics  suppor tability  and  the  cost  of 
system  maintenance* 

AS  new  systems  are  developed,  they  become  more  complex  with  re- 
spect to  the  sophistication  of  new  state-of-the-art  circuitry  and 
the  application  and  density  of  circuitry  within  equipment  ele- 
ments* 

Complex  and  large  systems  generally  have  inherently  low  mean  up 
times,  therefore,  a viable  logistics  support  plan  becomes  a prime 
factor  in  the  operation  of  such  systems. 

As  various  elements  of  the  HASP  system  become  defined,  final  part 
types,  part  quantities  and  categories  ultimately  selected  and 
circuit  packaging  determined,  a realistic  and  comprehensive  logis- 
tic support/spares  study  can  be  performed  on  the  FMP  support 
equipment. 

Spares  are  determined  through  a quantitative  analysis  which  basic- 
ally utilizes  item  failure  rates,  item  population  in  the  system 
and  applies  various  confidence  levels  to  meet  the  needs  of  the 
customer  requirements  and  the  established  qualitative  and  quanti- 
tative maintenance  requirements*  Burroughs  maintains  computerized 
programs  for  establishing  at  the  desired  confidence  level  spare 
part  support  quantities*  These  programs  can  be  used  to  establish 
the  spare  quantities  for  the  FMP  once  the  subsystem  becomes  better 
defined* 
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Spare  parts  selected  for  site  maintenance  consideration  fall  into 
three  basic  categories r namely » electronic  and  mechanical  piece 
parts,  subassemblies,  and  assemblies  classified  as  site  repairable 
in  accordance  with  the  established  maintenance  philosophy. 


Typical  part  types  in  these  three  categories  are: 


Piece  Parts: 


Puses 

Integrated  Circuits 

Diodes 

Resistors 

Capacitors 

Switches 

CRT’s 


Connectors 

Pin  & Socket  contacts 

Indicator  lamps  & LEDs 

Blowers 

Drive  Belts 

Motors 

Misc.  parts  (wire,  etc.) 


Subassemblies:  Individual  Processors 

Printed  Circuit  Cards  (logic) 

Power  Supply  regulator  cards 
Miscellaneous  Printer  subassemblies 
Misc.  Tape,  Disc  & Display  subassemblies 

Assemblies:  Power  supplies  (main) 

Keyboards 

Miscellaneous  Printer  assemblies 
Misc.  Tape,  Disc  a Display  assemblies 


Spare  parts  required  for  depot  or  factory  repair  support  will  be 
dependent  on  specific  items  identified  through  future  support 
efforts.  In  addition  to  sparing  parts  required  for  consumption  at 
the  depot  or  factory  level,  additional  items  will  be  required  to 
maintain  site  levels  of  high  value  and/or  non-reparable  items  for 
maintaining  the  "pipeline"  flow  to  the  site. 


CHAPTER  7 

FMP  TIMING  SIMULATIONS 


7.1  FMP  MODEL 

The  FMP  Model  (Figure  7*1)  includes  Extended  Memory  (EM)  Connec- 
tion Network  {CN)r  Coordinator  (CR),  and  one  or  more  processors 
each  including  one  Execution  Unit  (BU),  and  two  memory  modules. 
Synchronizing  signals  between  CR  and  EU*s  are  modeled,  as  are  the 
effects  of  CN  and  EM  characteristics  on  EU  instruction  times. 

The  time  resolution  of  the  simulation  is  a single  processor  clock, 
nominally  40  ns  for  a 25  Mhertz  clock. 

The  simulation  is  detailed  to  the  single  processor  and  coordinator 
(CR)  instruction  with  sufficient  accuracy  in  the  models  of  the 
various  functions  to  give  good  estimates  of  the  execution  times  of 
actual  code  samples.  The  detail  required  is  greatest  in  the 
processor,  where  it  nearly  equals  that  to  which  the  design  has 
been  carried.  The  coordinator  (CR)  is  modeled  less  completely, 
but  well  enough  to  model  instruction-level  execution  of  code  with 
reasonable  accuracy. 

The  EM  and  CN  are  modeled  only  to  the  extent  that  their  perform- 
ance parameters  are  accounted  for  in  the  timing  of  the  instruc- 
tions which  use  them. 

7.1.1  Processor  Model 

Figure  7.2  shows  the  functions  modeled  in  the  processor*  The  way 
these  functions  perform  is  best  shown  by  tracing  the  steps  in 
executing  instructions . 

It  is  necessary  in  some  cases  to  distinguish  between  functions 

performed  by  the  simulator,  which  use  no  simulation  time  nor 

resources,  and  are  indicted  by  (S),  and  the  functions  which  take 

time  and/or  resources  and  which  correspond  to  actual  hardware 
functions,  indicated  by  (M). 

The  simulated  code  file  (S)  contains  one  entry  for  each  instruc- 
tion of  the  actual  code  modeled.  The  PCR  (S)  points  to  the  next 
entry  of  the  simulated  code,  and  this  entry  when  fetched  (S) 

points  to  a coded  description  of  the  instruction  (S).  This  des- 
cription is  fetched  and  decoded  as  soon  as  the  previous  instruc- 
tion starts  executing.  The  coded  information  includes: 

(1)  Instruction  length  (code  space  taken) 

(2)  CU  synchronizing  action,  if  any 

(3)  Resources  used 

(IP,  FP,  DM,  CN  buffer/CN) 

(4)  Time  of  use  of  each  of  resources 

(5)  Reporting  information  if  a floating  point  arithmetic 
instruction. 
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Figure  7*1  Plow  Model  Processor r 
Showing  Functions  Included  in  Simulation  Model 
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3 Functions  Simulated  in  Processor  Model 
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After  fetching  and  decoding  (S)  the  instruction,  the  actual 
behavior  of  the  processor  in  fetching,  decoding,  and  executing  the 
instruction  is  modeled. 

7.1.2  Program  Fetch 

The  processor  memory  is  modeled  as  static  memory,  with  three 
clocks  access  time,  that  is,  new  data  is  available  at  the  output 
three  clocks  after  a new  address  is  supplied,  and  remains  avail- 
able statically  so  long  as  the  address  is  held.  The  actual  pro- 
gram address  register  is  not  modeled,  but  it  is  assumed  that  the 
next  program  address  is  supplied  as  soon  as  the  previous  program 
word  is  fetched  to  the  program  stack,  so  the  next  program  word  is 
available  three  clocks  later.  The  initiation  of  program  fetches 
is  therefore  driven  by  the  availability  of  space  in  program  stack 
(M).  The  space  that  was  occupied  in  program  stack  by  an  instruc- 
tion (M)  (as  specified  in  its  description)  becomes  available  as 
soon  as  it  starts  executing,  and  when  the  total  space  available  in 
program  stack  is  enough,  a code  word  (M)  is  transferred  to  it,  and 
the  next  program  fetch  is  initiated. 

The  above  program  fetch  sequence  is  subject  to  two  exceptions: 
When  a jump  or  a conditional  branch  is  executed,  the  program  stack 
is  marked  empty,  and  the  program  fetch  then  in  progress  is  restart- 
ed, so  that  the  new  program  word  is  not  available  for  execution 
for  three  clocks.  Furthermore,  this  action  itself  does  not  begin 
until  the  branch  instruction  has  been  executed.  The  latter  also 
applies  for  a test-and-br anch  instruction  when  the  branch  is  not 
taken;  the  next  instruction  cannot  start  executing  until  the  test 
is  completed.  Alternatively,  the  model  may  be  altered  so  that 

program  fetch  delay  occurs  when  the  branch  is  not  taken,  with 

program  execution  from  the  new  address  continuing  without  delay 
when  the  branch  is  taken. 

The  second  exception  case  for  program  fetch  can  occur  only  when 
the  model  of  processor  memory  is  made  homogeneous,  that  is,  both 
modules  are  shared  between  program  and  data  storage.  In  this  case 
a data  access  to  one  of  the  modules  aborts  the  program  access  then 
in  progress,  and  the  next  program  word  from  that  module  is  not 
available  for  two  memory  cycles  (6  clocks).  If  the  data  access 
and  the  transfer  to  program  stack  are  simultaneous,  the  transfer 
is  completed  without  delay,  so  the  maximum  program  fetch  delay 

which  can  be  caused  by  a single  data  access  is  five  clocks.  The 

module  for  data  access  is  selected  at  random  (S).  When  memory  is 
used  for  data  fetch  or  store,  it  is  treated  as  a single  resource, 
not  accessible  when  busy,  even  in  the  case  that  both  memory 
modules  may  be  used  for  data  storage  and  are  independently  access- 
ible. This  is  because  the  memory  addresses  are  always  modified  by 
an  integer  register,  so  tl  :>  actual  address  and  thus  the  module  to 
be  used  cannot  be  known  ...til  after  the  instruction  starts 
execution. 
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When  memory  is  modeled  as  homogeneous,  program  fetches  alternate 
between  the  two  modules,  but  only  a single  program  address  regis- 
ter is  assumed/  so  the  program  fetches  from  the  two  modules  are 
i.iitiated  simultaneously,  and  the  next  fetch  from  the  first  module 
cannot  be  initiated  until  the  fetch  from  the  second  module  is 
complete* 

7.1*3  Instruction  Execution 

After  the  instruction  is  decoded  (S),  and  the  resources  needed  and 
the  times  when  their  use  starts  have  been  determined,  the  score- 
board  (M)  is  examined  to  determine  whether  the  instruction  must  be 
delayed*  The  scoreboard,  updated  when  each  instruction  starts 
executing,  contains  the  time  at  which  each  resource  will  be  releas- 
ed. If  it  is  found  that  there  will  be  a delay,  further  (S)  proces- 
sing waits  until  the  resources  are  available*  Then  the  content  of 
Program  Stack  is  examined,  and  if  the  operator  syllable(s)  are  not 
present,  the  instruction  queues  (S)  until  the  next  program  fetch 
makes  the  syllables  available*  When  the  instruction  starts,  if 
its  use  of  any  of  the  resources  is  specified  as  delayed,  then  that 
part  of  the  instruction  must  wait  in  the  proper  Holding  Register 
(M)*  If  the  required  Holding  Register  is  in  use  by  the  prior 
instruction,  then  the  start  of  execution  is  delayed  until  the 
holding  register  is  available* 

Note  the  reversal  of  the  actual  order  of  operations:  The  instruc- 
tion cannot  actually  be  decoded  until  it  has  been  fetched,  and  any 
waiting  for  resources  must  follow  this*  However,  we  wish  to  know 
how  much  the  execution  of  code  is  delayed  by  program  fetch,  so  we 
do  not  count  any  fetch  time  during  which  the  instruction  would 
have  been  waiting  for  resources  anyway. 

The  actual  processor  probably  would  not  use  a scoreboard  as  above, 
because  this  mechanism  for  controlling  instruction  overlap  is  not 
fully  effective  when  the  execution  time  for  instructions  is  data 
dependent,  as  will  probably  be  the  case  for  arithmetic  operations. 
A mechanism  similar  to  the  holding  registers  would  be  used,  where 
the  various  parts  of  the  instruction  can  wait  for  their  resources. 
An  exar^^le  of  the  difference  in  timing  in  the  two  cases  is  shown 
in  the  timing  charts  of  Figure  7.3.  Here  the  instructions  using 
the  Integer  Processor  (Numbers  2,  3,  5)  start  much  sooner  in  (b) 
than  in  our  model  (a).  This  is  because  when  an  instruction  enters 
the  queue,  the  next  instruction  is  available  for  decoding,  whereas 
in  (a),  when  an  instruction  is  delayed  by  the  scoreboard,  the  de- 
coding hardware  is  tied  up  until  the  instruction  starts.  However, 
note  that  in  both  examples  the  Floating  Processor  is  busy  full 
time,  and  the  FP  instruction  (4)  starts  at  the  same  time  in  both. 
It  is  reasonably  clear  that  the  queueing  mechanism  will  allow  more 
instruction  overlap  in  some  sequences,  giving  a reduction  in 
execution  time,  but  such  cases  will  be  uncommon  and  only  a small 
reduction  in  total  execution  time  can  be  expected* 


INSTRUCTION  TIME  Mn  clocks 


(a)  Using  Scoreboard 


(b)  With  Queueing  for  Resources 


Figure  7.3 

Simple  Instruction  Timing  Diagram^ 
Contrasting  Scoreboard  and  Queueing  Implementations 
of  Instruction  Overlap 


The  IP,  PP,  DM  and  CNB  resources  are  modeled,  and  utilization  of 
these  resources  is  reported.  Program  memory  (PM)  is  modeled  as 
two  separate  resources  when  the  memory  is  homogeneous,  but  is 
shown  to  be  utilized  only  during  the  memory  cycle  time  actually 
used  for  access.  That  is,  neither  cycles  aborted  by  data  access 
to  that  memory,  nor  the  time  spent  holding  output  while  waiting 
for  space  in  the  program  stack  are  counted  as  program  memory 
utilization  time. 

A running  count  (S)  is  maintained  of  the  number  of  instructions 
executing,  this  count  being  updated  whenever  an  instruction  starts 
or  ends.  Special  resources  (S)  are  used  causing  reports  of  the 
percentage  of  the  simulated  execution  time  that  1,  2,  or  3 
instructions  are  executing  concurrently. 

7.1.4  Synchronizing  Action 

The  state  of  synchronization  of  the  processor  is  described  by  the 
state  of  two  flipflops  (M):  I GOT  HERE  (IGH),  which  is  set  by 

certain  instructions,  and  ENabled  (EN),  which  is  set  whenever  the 
processor  is  enabled,  and  when  reset  causes  the  processor  to  stop 
executing  before  the  next  instruction.  A logic  level  formed  by 
the  logic  combination  (IGH  or  EN) , ANDed  with  the  same  logic  level 
from  all  other  processors  is  transmitted  to  the  coordinator  (CR). 
The  IGH  is  reset  when  a GO  pulse  is  received  from  the  CR  and  EN  is 
set  by  an  ENABLE  pulse  from  CR  (M)  . Cable  delays  for  these 

signals  are  zero  from  CR  to  EU,  and  three  clocks  from  EU  to  CR, 
because  the  sys“^em  clock  is  assumed  to  be  in  CR,  so  that  signals 
from  CR  travel  with,  and  arrive  at  the  same  time  as  the 

corresponding  clock. 

The  uses  of  synchronizing  action  are  as  follows:  Certain  instruc- 

tions (FETCHEM,  BDCST,  HVST,  SHIPCN)  involve  exchange  of  data 
through  the  CN,  under  overall  control  of  the  CN  by  CR,  with  clock- 
ed, synchronized  data  transmission.  Therefore,  all  processors 
must  be  at  the  proper  point  in  the  instruction  (or  disabled) 

before  the  data  transmission  can  begin.  Such  instructions  set  IGH 
during  execution  and  then  wait  for  GO  from  CO  before  completing  in 
synchronism  across  the  array.  Since  all  such  instructions  use  the 
CN  Buffer  (CNB)  unit,  and  the  data  exchange  is  via  the  data  buffer 
internal  to  CNB  the  processor  can  continue  executing  succeeding 
instructions  as  soon  as  IGH  is  set,  provided  that  none  of  them  use 
CNB  or  set  IGH.  Certain  other  instructions  (WAIT)  set  IGH  and 
then  wait  for  GO  before  starting  the  next  instruction.  Obviously, 
such  an  instruction  cannot  start  if  IGH  is  already  set,  but  must 

wait  for  the  GO  to  reset  IGH  before  going  on.  The  STOP  instruc- 

tion resets  EN  and  then  stops  instruction  execution  until  the 
ENABLE  from  CR  again  sets  EN. 


7*1*5  External  Access  Model 


The  CN  buffer  (CNB)  unit  contains  registers  for  the  Extended 
Memory  address/  which  are  loaded  from  Integer  Registers,  and  a 
Data  Buffer  to  hold  data  which  is  to  be  transmitted  through  the  CN 
or  which  is  received  from  it.  The  data  buffer  in  CNB  may  be  empty 
or  full/  and  in  either  case  may  be  busy  or  available.  However,  in 
our  model,  the  buffer  is  always  available  when  CNB  is  available, 
and  is  then  either  FULL  or  EMPTY  depending  on  the  last  CNB  instruc- 
tion executed.  The  CNB  functions  are  designed  so  that  those  which 
transmit  data  through  CN  (STORBM,  HVST,  SHIPTN)  specify  the  pro- 
cessor source  for  the  data  (DM,  PP  register,  one  or  more  IP  regis- 
ters), and  so  appear  in  several  versions.  However,  those  instruc- 
tions which  receive  data  from  CN  (LOADEM,  PETCHEM,  BDCST,  SHIPTN) 
do  not  specify  a processor  destination,  but  leave  the  data  in  the 
FULL  data  buffer  in  CNB,  from  which  it  is  transferred  by  a second 
instruction  (-REM)  which  specifies  the  processor  destination.  The 
reason  for  this  is  that  there  may  be  appreciable  delay  in  using 
the  CN,  either  by  conflicts  in  CN  or  at  EM  for  LOADEM,  or  by 
waiting  for  other  processors  in  the  synchronized  CNB  instructions. 
In  either  case,  the  separation  of  the  CNB  action  and  the  transfer 
to  processor  destination  allows  the  compiler  to  save  execution 
time  and  mask  the  delay  time  by  inserting  the  CNB  instruction  as 
early  as  possible,  followed  by  as  many  other  instructions  as  pos- 
sible before  calling  for  the  data. 

A CNB  instruction  (or  -REM,  which  also  requires  CNB  to  be  avail- 
able) cannot  start  while  CNB  is  still  busy  with  a prior  CNB 
instruction.  In  our  model,  succeeding  instructions,  therefore, 
must  also  be  delayed,  although  in  a queueing  model,  having  a 
register  for  queueing  CNB  instructions,  succeeding  instructions 
not  requiring  any  of  the  same  resources  might  continue  execution. 

7*1.6  Branching 

We  model  only  the  execution  time  of  instructions,  and  the  model 
does  not  "know”  what  they  do,  nor  do  the  modeled  instructions 
contain  any  data  or  addresses.  Therefore,  branching  must  be  con- 
trolled by  special  code  words  in  the  simulated  code  file  which  do 
not  model  any  actual  instructions  but  contain  coded  branch  control 
information  to  be  used  by  the  simulator.  Such  words  can  be  insert- 
ed anywhere,  and  their  processing  does  not  take  any  time  nor 
resources.  However,  every  time  a branch  is  executed,  the  simulat- 
ed code  addresses  are  reported,  in  order  to  allow  tracing  the 
execution  of  the  first  simulated  code.  The  branch  controls  oper- 
ate as  follows:  The  first  time  a branch  control  is  executed,  a 

processor  subroutine  is  initiated  to  process  it.  The  code  word 
contains  algorithms  to  specify; 
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(a)  The  branch  address r which  may  be  dependent  on  processor 
number  if  desired, 

(b)  The  repeat  number,  N,  which  may  be  dependent  on  processor 
number,  and  which  may  be  calculated  from  a probabolistic 
algorithm  if  desired, 

(c)  The  kind  of  branch  action  desired.  Program  control  may 
either  drop  through  N tiroes  and  then  branch  or  may 
branch  N times  and  then  drop  through.  In  either  case, 
when  the  branch  control  has  been  executed  N+1  times,  the 
simulator  subroutine  terminates,  and  the  next  time  the 
control  word  is  executed  the  process  is  reinitiated. 

(d)  Special  CALL/RETORN  constructs  are  available  for  sub- 
routine calls. 

7.1.7  Coordinator 


The  coordinator  (CR)  has  its  own  instruction  set  and  simulated 
code  file.  However,  the  CR  is  modeled  in  less  detail  than  the  EU, 
so  the  simulated  instruction  description  requires  only  one  coded 
word  containing  six  parameters;  Three  kinds  of  synchronizing 
actions,  size  of  instruction,  CR  memory  action,  and  CR  processor 
time.  The  processor  is  modeled  as  a single  unit,  so  no  instruc- 
tion overlap  is  modeled,  except  for  memory  access  portions.  The 
CR  memory  is  modeled  as  a single  module  used  for  both  program  and 
data.  The  interaction  is  simplified  by  assuming  that  program 
fetch  is  initiated  only  when  there  is  space  for  the  new  word  in 
program  stack  (two  words  capacity),  and  once  initiated  a program 
fetch  is  not  interrupted  by  data  access.  There  is  usually  little 
contention  for  memory,  since  data  access  in  CR  tends  to  be  infre- 
quent. 

Branch  control  in  CR  is  implemented  in  essentially  the  same  way  as 
in  the  EO,  except,  of  course,  there  can  be  no  dependence  on  proces- 
sor number. 

7.1.8  Extended  Memory  Access 

The  new  Connection  Network  is  used  asynchronously,  which  has 
important  advantages  over  the  synchronous  TN  when  the  pattern  of 
EM  access  is  different  in  different  EC's  because  of  data  dependent 
branching,  or  when  the  pattern  is  not  a P-ordered  vector.  How- 
ever, internal  conflicts  in  the  CN,  or  multiple  requests  to  the 
same  EM  module  can  cause  some  accesses  to  be  delayed.  The  delays 
are  determined  by  the  actual  pattern  and  timing  of  accesses  across 
the  entire  array.  However,  within  the  framework  of  this  simulator 
it  is  impossible  to  model  these  delays  exactly  because: 
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(1)  We  model  only  a few  processors  (usually  one),  not  512. 

(2)  We  model  only  the  execution  time,  not  the  actual  data 
processed,  so  the  access  patterns  are  not  modeled. 

(3)  The  simulation  model  for  the  CN  is  complex,  so  that  it  is 
impractical  to  incorporate  it  into  the  FMP  model. 


Therefore,  the  CN  delay  is  modeled  as  a probability  distribution. 
The  nominal  distribution  is  exponential,  with  five  clocks  expected 
value.  Separate  simulations  (see  Appendix  B)  of  random  accesses 
to  Extended  Memory  through  the  Connection  Network  under  maximum 
possible  load  conditions  indicate  an  average  access  delay  of  about 
one  CN  clock  (three  processor  clocks).  Since  the  CN  simulator  has 
not  been  run  long  enough  in  any  test  case  to  reach  steady  state, 
we  assume  that  the  distribution  of  delays  may  have  a longer  tail 
than  the  exponential,  so  we  approximate  this  worst  case  by  an 
exponential  distribution  with  four  clocks  expected  value  (and  four 
clocks  standard  deviation).  The  standard  deviation  of  the  average 
delay  for  100  accesses  is  therefore  0,4  clocks,  so  that  the  worst 
case  out  of  512  processors  would  be  expected  to  have  an  average 
delay  2.9  standard  deviations  greater,  or  5.2  clocks. 

Extended-memory-access  instructions  are  thus  modeled  with  the  CN 
delay  added  to  normal  execution  time  of  the  instruction.  The 
decoding  of  the  next  instruction  and  its  overlapping  execution  (if 
possible)  is  not  delayed.  The  execution  of  Processor  code  is 
delayed  by  instructions  which  use  the  CN  buffer  only  if  the  CNB  is 
still  busy  with  the  last  such  instruction,  since  the  EM  accesses 
are  managed  by  CNB  without  interfering  with  the  use  of  any  other 
processor  resources.  Delay  is  minimized  by  ordering  the  code  so 
as  to  interpose  other  instructions  between  CNB  uses.  In  parti- 
cular, the  -REM  type  instruction  which  uses  the  data  fetched  to 
the  Data  Buffer  in  CNB  by  an  EM  access,  is  placed  as  late  as 
possible  in  the  instruction  stream.  By  these  means,  the  code  (FX 
subroutine)  which  suffered  the  largest  delay  from  waiting  for  CNB 
was  delayed  only  11.4%  (see  Sec.  7. 2. 6. 2,  and  Table  7.2).  In  this 
case,  reducing  the  contention  delay  discussed  above  from  five 
clocks  to  ^5  clock  increased  the  throughput  only  4.0%. 

7.1.9  Simulation  Results 

The  primary  information  provided  by  the  simulation  run  is  the 
elapsed  time  required  to  run  the  simulated  code,  and  the  number  of 
floating  arithmetic  operations  performed,  which  together  give  the 
throughput  in  floating  operations  per  second. 
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Much  additional  information  is  reported,  such  as  the  total  exe- 
ctuion  time  of  the  arithmetic  part  of  floating  point  instructions, 
delays  caused  by  branching  and  program  fetch,  utilization  of  the 
various  processor  and  coordinator  resources,  and  the  extent  of 
instruction  overlap  achieved  in  the  processor (s) . This  infor- 
mation can  be  useful  in  understanding  why  the  processing  of  the 
code  behaved  as  it  did,  and  is  useful  in  guiding  the  details  of 
hardware  and  software  design* 

7 • 2 SIMULATIONS  PERFORMED 

The  codes  segments  simulated  were  selected  from  the  Hung- 
MacCormack  explicit  and  the  3-D  implicit  aero  flow  codes,  and  from 
a GISS  weather  code.  The  criteria  for  code  selection  were,  first, 
to  select  a range  of  types  of  codes  to  cover  a wide  range  of  flop 
throughput  and  of  factors  influencing  the  throughput,  and  second, 
to  include  from  each  code  samples  typical  of  those  portions  which 
account  for  the  major  portion  of  the  execution  time  of  the 
program. 

By  comparing  each  block  of  code  in  an  entire  program  with  the  code 
segments  actually  simulated,  it  was  then  possible  to  estimate 
throughput  for  that  block,  and  by  proper  weighting,  to  estimate  an 
average  throughput  for  the  entire  run  of  each  program.  These 
estimates  are  probably  on  the  low  side  because  the  parameters  used 
in  the  simulation  model  are  conservative; 

(a)  The  assumed  40-ns  clock  period  is  ample  for  ECL  logic. 
This  allows  safe,  conservative  logic  design,  and  actual 
detailed  design  may  show  that  a slightly  faster  clock  is 
feasible. 

(b)  The  execution  time  for  arithmetic  instructions  is 
assumed  constant.  If  the  instruction  logic  is  designed 
to  give  data-  dependent  execution  time,  the  assumed 
constant  value  is  near  the  worst  case,  and  the  average 
execution  time  will  be  considerably  less. 


(c)  No  great  sophistication  of  the  compiler  is  assumed, 
either  in  the  generation  of  efficient  code  or  in 
optimization  of  register  allocation  or  instruction 
reordering . 

(d)  The  scoreboard  method  for  controlling  the  overlap  of 
instruction  execution  is  assumed.  As  discussed  in 
Section  7.1,  this  is  less  effective  than  the  queueing 
method  which  would  probably  be  used. 


(e)  The  use  of  the  new  connection  Network  for  access  to 
Extended  memory  will  produce  some  delays  caused  by 
contention  within  the  CN  or  at  EM.  These  delays  are 
difficult  to  estimate  accurately,  so  a conservative 
estimate  was  used.  The  actual  delays  in  real  program 
runs  will  probably  be  considerably  less  than  the 
simulated  value. 


7.2.1  Selected  Codes 

The  selected  code  segments  were  TU3RBDA,  AMATRX,  and  a portion  of 
BTRI  from  the  implicit  code,  LX  and  a portion  of  CHARAC  from  the 
Explicit  code,  and  parts  of  AVRX,  C0MP2,  and  COMPS  from  the  GISS 
weather  code.  The  throughput  indicated  by  simulation  of  these 
code  samples  ranges  from  70  MFLOPS  for  AVRX  to  1330  MFLOPS  for 
AMATRX.  The  simulation  results  are  summarized  in  Table  7.1,  which 
includes  additional  information  on  utilization  of  processor  re- 
sources and  delays  in  execution  as  reported  by  the  simulator.  A 
detailed  discussion  of  this  table,  and  the  throughput-controlling 
features  of  the  several  codes  follows.  The  PMP  FORTRAN  and  assem- 
bly-language versions  of  the  codes  as  simulated  are  given  in 
Appendix  G. 

Some  of  the  earlier  simulations  were  performed  with  a model  of  the 
Transposition  Network,  in  which  accesses  to  EM  are  synchronous 
across  the  array,  and  controlled  by  the  CU.  Comparison  of  the  CN 
and  TN  performance  indicate  that  the  average  contention  delay 
involved  in  using  the  CN  is  compensated  by  the  fact  that  EM  access 
instructions  through  CN  can  be  more  completely  overlapped  because 
of  the  buffering  action  of  the  associated  CN  Buffer  unit  in  the 
processor,  and  by  the  use  of  a much  faster  implementation  of  the 
IMOD521  instruction  in  the  later  version  of  the  model.  The  ear- 
lier simulation  results  therefore  remain  essentially  valid. 


7.2.2  TURBDA 

This  code  has  four  EM  accesses  (three  LOADEM’s  and  one  STOREM)  in 
the  inner  loop,  and  28  floating  arithmetic  operations,  of  which  18 
are  concentrated  in  an  in-line  Newton-Raphson  square  root.  The 
somewhat  low  throughput  of  835  MFLOPS  is  accounted  for  by  three 
factors; 

(a)  The  Average  execution  time  of  8.1  clocks  per  floating 
arithmetic  operation  is  about  10  percent  longer  than 
average  because  of  a somewhat  higher  than  average  pro- 
portion of  divides  and  multiplies. 
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(b)  The  integer  operations  form  a higher  than  average  pro- 

port ion  ^ as  shown  by  the  IP  use  percentage  and  are  not 
overlapped  as  much  as  usual  by  floating  operations,  ^ 

(c)  There  are  only  seven  floating  operations  per  EM  access. 


7.2.3  AMATRX 

This  section  of  the  implicit  code  is  involved  with  generating  the 
local  five  by  five  matrices  to  be  inverted  by  BTRI.  Each 
iteration  of  the  inner  loop  performs  80  floating  point  operations 
for  five  EM  accesses  (LOADEMs)^  and  53  local  memory  accesses. 
This  is  a rather  favorable  case,  as  shown  by  the  high  (84  percent) 
utilization  of  the  PP  unit  in  the  processor.  The  fact  that  the  EM 
and  local  memory  accesses  are  performed  with  no  more  than  about  20 
percent  loss  from  the  maximum  theoretical  throughput  of  1680 
MPLOPS  for  the  per-PLOP  time  of  7.6  clocks  indicates  the 
effectiveness  of  the  instruction  overlap. 

7.2.4  BTRI 

A representative  portion  of  the  BTRI  subroutine  was  hand  compiled 
for  this  simulation.  About  77  percent  of  the  floating  operations 
are  concentrated  in  an  inner  loop,  which  is  a 5 by  5 nested  DO 
loop  with  only  10  floating  operations  and  7 indexed  local  memory 
fetches  in  the  inner  loop.  BTRI  runs  slower  than  might  be  expect- 
ed for  a subroutine  with  no  EM  accessing  because: 

(a)  The  indexing  of  the  local  arrays  causes  a large  number 

of  integer  operations  which  cannot  be  entirely 

overlapped  by  the  few  PLOPS. 

(b)  Several  of  the  integer  operations  are  large  (48-bit) 

instructions,  but  with  short  execution  time,  so  that 
they  use  up  code  faster  than  it  can  be  fetched.  The 

result  is  the  indicated  14.2  percent  of  elapsed  time 
spent  waiting  for  program  fetch. 

(c)  The  nested  DO  loop  causes  a large  number  of  branches, 

causing  3.3  percent  of  time  to  be  spent  waiting  for 

program  fetch  after  branch,  and  further  aggravating  (b) 
above  because  every  branch  causes  any  program  look-ahead 
which  has  been  done  to  be  wasted. 

Note  that  some  of  these  inefficiencies  could  be  reduced  by  unwind- 
ing the  inner  DO  loop,  which  would  involve  repeating  the  same 
brief  code  section  five  times.  Some  of  the  cost  of  indexing  the 
local  arrays  could  be  saved  by  reprogramming  to  store  the  N differ- 
ent five  by  five  matrices  which  are  generated  by  AMATRX  in  a 25  by 
N array  instead  of  a five  by  five  by  N array.  This,  together  with 
the  unwinding  of  the  inner  drop,  would  considerably  increase  the 

throughput  of  BTRI.  The  loss  of  program  look-ahead  on  branching 

can  be  reduced  by  implementing  fetch  on  no-branch  instead  of  fetch 
on  branch. 


4 
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7.2.5  GISS  Climate  Code  Samples 


Analysis  of  the  weather /climate  code  throughput  is  discussed  in 
Section  3.4.4.  The  samples  selected  for  simulation  were  as 
follows. 


7. 2. 5.1  AVRX 

This  routine  smooths  the  data  in  the  longitude  direction  for 
latitudes  near  the  poles  in  order  to  compensate  for  the  too-close 
spacing  of  these  grid  points.  The  number  of  iterations  of  the 
smoothing  algorithm  therefore  depends  on  latitude  and  must  be 
computed.  The  index  computations  are  fairly  complex,  and  the 
smoothing  algorithm  itself  is  very  simple,  so  there  are  less  than 
one  floating  arithmetic  operation  per  EM  access,  and  much  integer 
computation. 

Furthermore,  the  organization  into  a DOALL  in  which  26  instances 
are  allocated  to  each  processor  in  order  to  leave  no  processors 
idle  considerably  increases  the  integer  computation  to  be  executed 
in  each  instance,  thus  partially  defeating  the  purpose. 

The  net  result  is  a code  sample  which  is  in  a sense  a worst  case 
for  the  PMP  with  floating  point  throughput  of  only  70  MFLOPS. 

7. 2. 5. 2 COMP2 

Portions  of  the  CORIOLOS  FORCE  and  VERTICAL  ADVECTION  code  were 
hand  compiled  and  simulated.  This  code  performs  only  about  two 
floating  point  operations  per  Extended  Memory  access,  and  the 
addresses  in  two-  or  three-  dimensional  EM  arrays  are  calculated 
from  the  indices  with  no  shortcuts,  so  that  one  or  two  double- 
precision integer  multiplies  are  required  for  each  calculation. 

The  result  is  that  integer  arithmetic  dominates  the  code,  as  shown 
by  the  COMP2  entry  in  Table  7.1,  where  the  integer  processor  is 
busy  65  percent  vs  only  40  percent  for  the  Floating  Processor. 
The  floating  arithmetic  also  is  above  average  in  clocks  per 
operation  (10.3),  and  floating  arithmetic  is  being  executed  only 
30  percent  of  the  elapsed  time. 

7. 2. 5. 3 COMPS 

The  portion  of  COMP3  simulated  was  LINKHO  having  no  EM  accesses. 
The  throughput  of  980  MFLOPS  indicated  by  simulation  is  only  25  or 
30  percent  less  than  the  practical  maximum  of  about  1300  to  1400 
MFLOPS  attained  when  the  processors  are  doing  floating  arithmetic 
80  percent  of  the  time.  The  COMP3  result  is  lower  because  of 
three  factors: 

(a)  The  average  floating  arithmetic  operation  takes  9.1 
clocks,  compared  with  the  nominal  average  of  7.3, 
because  of  a higher  proportion  of  multiplies  and 
divides. 
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significant  Factors 
Table  7.1  FMP  Simulation  Results 


Note  1.  See  Appendix  A for  discussion  of  AVRX 


(b)  The  FP  arithmetic  is  being  done  only  70  percent  of  the 
time?  although  the  FP  unit  is  busy  82  percent.  This  is 
because  of  a number  of  non-arithmetic  floating  register 
operations  such  as  change  sign,  move,  and  local  memory 
access. 

(c)  Program  branching  causes  delays  amounting  to  nearly  four 
percent  of  the  time. 

7.2.6  3-D  Expl icit  Aero  Flow  Code 

In  the  PMP  model  used  for  the  simulations  shown  in  Table  7.2  and 
reported  below,  the  local  memory  is  homogeneous:  both  modules  are 

shared  between  data  and  program.  One  test  run  with  the  processor 
model  having  separate  program  and  data  memories  shows  about  5 
percent  lower  throughput  because  of  waiting  for  program  fetch. 
Circles  are  used  in  Table  7.2  to  call  attention  to  those  items 
which  limited  throughput  for  each  simulation. 

7. 2. 6.1  CHARAC 

This  third  level  subroutine  from  the  explicit  code  has  no.  EM 
accesses,  but  contains  many  data  dependent  branches,  and  DO  loops 
whose  iteration  count  varies  because  of  data  dependent  exits.  The 
CHARAC  code  would  therefore  be  very  difficult  to  vectorixe,  but 
presents  no  difficulty  to  the  parallel  machine,  although  of  course 
the  tests  cost  time. 

The  section  of  CHARAC  code  (shown  in  Appendix  H)  which  was  simula- 
ted consists  of  a DO  loop  on  JC  and  a portion  of  the  code  follow- 
ing the  JC  loop.  The  JC  loop  is  preceded  by  several  local  memory 
accesses  to  save  local  registers,  and  within  the  loop  are  24  float- 
ing point  arithmetic  operations.  It  is  exited  by  the  AND  of  two 
comparisons  of  floating  variables.  The  JC  loop  contains  an  inner 
DO  loop  on  JJ,  which  performs  only  integer  operations,  and  is 
exited  by  the  AND  of  two  comparisons  of  floating  point  variables. 

Three  simulations  were  performed,  varying  the  JC  and  JJ  counts,  as 
shown  in  Table  7.2: 

(a)  JC  loop  performed  eight  times,  with  JJ  performed  six 
times  in  each.  This  gives  the  low  throughput  of  900 
MPLOPS. 

(b)  JC  loop  performed  15  times,  with  JJ  performed  once  in 
each,  giving  throughput  of  1180  MFLOPS. 

(c)  Same  as  (a),  but  with  JJ  loop  reprogrammed  in  FORTRAN  to 
use  only  one  rather  than  two  comparisons  of  floating 
variables  to  decide  the  exit  from  the  loop,  and  with  the 
new  JJ  loop  coded  for  maximum  efficiency  by  hand,  using 
tricks  a compiler  might  be  smart  enough  to  use.  The 
throughput  of  this  version  is  990  MFLOPS,  or  lO?")  more 
than  version  (a),  because  of  the  40%  reduction  in  run- 
ning time  of  the  recoded  JJ  loop.  If  the  JJ  loop  is 
performed  fewer  times,  the  throughput  will  approach  or 
slightly  exceed  case  (b). 


I 

I Table  7.2 

I Summary  of  Simulations  of  EXPLICIT  CODE 
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Nearly  half  of  the  JJ  loop  time  is  accounted  for  by  waiting  for 
program  fetch,  both  after  a branch,  and  when  code  is  being  exe- 
cut(*d  faster  than  it  can  be  fetched.  This  is  another  case  where 
more  efficient  packing  of  code  or  faster  access  to  program  memory 
would  appreciably  inuprqMe  i^hrqughHUfl 

It  is  in^:eresting  to  note  that  in  some  algorithms  the  programmer 
can  use  arithmetic  comparisons  and  conditional  branching  to  save 
some  arithmetic.  In  such  cases  the  throughput  measure  of  programs 
would  be  more  consistent  if  floating  point  comparisons  were  consi- 
dered to  be  arithmetic  operations;  otherwise  a program  with  super- 
ior performance  might  be  measured  as  having  lower  throughput.  If 
this  measure  were  applied  to  the  three  cases  of  CHARAC  discussed 
above,  they  would  become  nearly  equal  at  about  1300  MFLOPS. 

Examination  of  Table  7.2  shows  the  following  important  factors 
affecting  throughput  of  the  CHARAC  sample. 

(a)  The  average  execution  time  of  a floating  arithmetic 

operation  is  7 to  8 percent  higher  than  the  nominal  7.4 

clocks.  This  is  because  of  a higher  than  average  propor- 
tion of  divides 

(b)  Waiting  for  program  fetch,  both  after  branching  and  in 

other  places  accounts  for  13  to  15  percent  of  the  elap- 
sed time  in  cases  (a)  and  (c).  The  high  utilization  of 
Program  Memory  is  not  responsible:  most  of  both  delays 

occur  in  the  JJ  loop,  which  uses  a good  deal  of  program 
space  while  requiring  little  execution 

(c)  The  utilization  of  the  Floating  Processor  is  low  in 

cases  (a)  and  (c),  even  allowing  for  program  fetch 
delays,  indicating  a good  deal  of  non-over lapped  integer 
computation.  Again,  this  is  mostly  in  the  JJ  loop,  as 
indicated  by  case  (b)  where  the  JJ  range  of  code  is 
executed  only  once,  and  the  PP  utilization  is  only  about 
12%  below  the  values  attained  in  AMATRX  and  COMP3. 

7. 2. 6. 2 LX/PX 

The  second  level  subroutine  LX  from  the  explicit  aero-flow  code 
executes  within  a DOALL  with  J and  K as  domain  variables  and  each 
instance  has  inner  DO  loops  on  I,  with  IL  or  IL-2  iterations.  The 
third  level  subroutine  PX  is  called  in  an  inner  loop  that  is  per- 
formed twice,  so  FX  is  called  about  twice  IL  times  in  each  in- 
stance of  JK.  The  simulations  were  run  with  the  IL  and  IL-2  loops 

both  performed  10  times,  since  the  computer  runs  would  have  been 

too  long  with  values  of  100  and  98.  At  ten  iterations  the  code  in 
the  loops  dominates  the  running  time,  so  there  is  little  error  in 
this  approximation.  Separate  simulations  of  LX,  with  FX  calls 
deleted,  and  of  FX  code  (with  no  RETURN)  were  performed.  For 

interest  sake,  the  SQRT  code  which  is  present  in-line  in  FX  was 

also  timed  by  using  the  trace  in  the  simulation  output. 


The  results  are  shown  in  Table  7*2.  The  PX  calls  contribute  about 
70  percent  of  the  PLOPS  in  LX/PX^  and  the  tabulated  figures  for 
LX/PX  agree  with  the  weighted  average  between  LX  and  PX  figures. 
The  LX/PX  throughput  of  570  MPLOPS  is  limited  by  the  factors 
circled  in  the  table:  (1)  a mix  of  arithmetic  instructions  that 

gives  an  average  execution  time  of  8.3  clocks  or  about  12%  more 
than  average,  (2)  a high  usage  (48%)  of  the  integer  processor, 

(3)  appreciable  delay  (4.5%)  for  program  fetch  after  branch  and 

(4)  9.2%  of  the  running  time  is  spent  waiting  for  extended  memory 
access. 

LX  is  structured  in  such  a way  that  much  of  the  EM  data  for  the  IL 
loops  is  pre-fetched  to  local  arrays  in  a DO  loop  of  IL  iterations 
which  per">rms  no  floating  point  arithmetic,  and  similarly,  at  the 
end  the  local  array  of  results  is  written  back  to  EM.  These  pre- 
fetch and  post-store  portions  of  LX  take  12%  of  its  time  (not 
counting  PX) . Similarly  PX  was  coded  to  precalculate  and  save 
indices  and  local  variables  used  repeatedly  in  the  code,  and  this, 
together  with  save  and  restore  of  registers  used  in  PX  takes  13% 
of  PX  time.  The  rest  of  LX  and  PX  appear  to  be  normal  code,  with 
no  more  than  average  amounts  of  pure  integer  operations  and  loop- 
ing and  branching,  so  that  the  results  should  be  considered  normal 
for  the  f lops-per-EM-fetch  ratio  of  these  codes. 

It  is  clear  from  the  LX/PX  simulations  that  at  their  rate  of  EM 
accesses  about  half  the  execution  time  is  spent  doing  the  EM 
accesses  and  the  integer  computations  of  EM  addresses.  As  an 
experiment,  a simulation  was  run  with  the  average  delay  caused  by 
contention  in  CN  and  EM  reduced  to  1/2  clock,  as  would  be  expected 
for  the  actual  average  loading  of  CN  (12%).  This  reduced  the 
running  time  of  FX  by  4.0%.  In  a second  experiment,  the  execution 
of  times  of  double  precision  integer  arithmetic  were  reduced  to 
the  values  estimated  for  single  precision  in  a 32  bit  integer 
arithmetic  unit.  This  produced  a further  6%  reduction,  for  a 
total  of  10%  reduction  with  both  changes,  or  an  PX  throughput  of 
650  MPLOPS. 

7. 2. 6. 3 SORT 

A new  square  root  macro  was  programmed,  using  recently  added 
integer/floating  transformation  instructions  (FIX,  FLOAT,  ADDEX, 
MOVEX)  to  address  a local  memory  table  for  a first  approximation 
good  enough  so  that  only  three  iterations  are  necessary*  The 
resulting  SQRT  has  14  flops,  runs  in  119  clocks,  and  has  a through- 
put of  1500  MPLOPS  for  the  array.  The  entry  in  Table  7.2  is 
incomplete  because  SQRT  was  not  simulated  by  itself,  the  values 
shown  being  extracted  from  the  trace  of  the  FX  simulation*  This 
SQRT  accounts  for  10%  of  the  flops  of  LX/PX.  The  SQRT  found  in 
TURBDA  was  an  earlier  version. 


7.3  APPLICATION  OP  SIMULATOR  RESULTS 


The  above  simulation  results  from  the  basis  for  the  application 
analysis  summarized  in  Chapter  3 and  described  in  more  detail  in 
Appendix  A.  The  extension  of  the  simulator  measurements  to  those 
code  sequences  that  were  not  simulated^  is  also  described  in  those 
locations. 
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Chapter  8 

SCHEDULE  AND  FACILITIES 


8.1  SCHEDULE 

8.1.1  Introduction 

Realistic  scheduling  of  a large  program  such  as  NASF  requires  the 
systematic  definition  of  the  tasks  to  be  performed  to  levels  for 
which  reasonable  estimates  can  be  made.  With  each  successive 
level  of  detail  the  time  estimates  become  more  accurate.  For  the 
purposes  of  this  study  only  the  first  level  has  been  estimated  for 
the  total  effort.  It  therefore  must  be  considered  tentative. 
Second  and  third  level  schedules  have  been  prepared  in  specific 
tasks  areas  to  demonstrate  the  refinements  that  ultimately  must  be 
prepared  for  the  total  effort  and  to  illustrate  the  management 
tools  that  can  be  used  to  monitor,  analyze,  and  control  the 
program  schedule. 

8.1.2  The  Overall  NASF  Program  Schedule 

The  NASF  program  schedule  presented  in  Figure  8.1  is  based  on  a 
number  of  factors  and  assumptions.  It  is  assumed  that  the  initial 
sixteen  months  is  dedicated  to  the  design  and  final  specification 
effort.  After  this  initial  effort  final  design  leading  to 
procurement,  tooling  and  manufacturing  will  begin.  Most  of  these 
implementation  tasks  are  of  the  order  of  fifteen  to  twenty  one 
months.  The  final  period  of  integrations,  delivery,  installation 
and  testing  is  estimated  at  eighteen  months.  This  results  in  a 
total  program  duration  of  55  months.  The  estimates  are  based  on 
past  experience  and  best  judgement.  They  do  not  represent  either 
the  best  or  worst  case  possibilities. 

No  attempt  has  been  made  to  define  a critical  path  for  this 
summary  schedule.  However,  critical  paths  have  been  determined  on 
schedules  of  individual  activities  as  will  be  demonstrated  in ‘ the 
examples  that  follow.  The  final  output  of  the  overall  program 
schedule  is  shown  as  the  “deliverables”. 

The  NASF  schedule  has  been  divided  into  nine  task  areas. 

1.  Program  Management 

2.  Systems  Management,  Integration  and  Test 

3.  Flow  Model  Processor 

4.  File  Memory  Subsystem 

5.  Support  Processor  Subsystem 

6.  System  Software 

7.  User  Support  Subsystem 

8.  Facility  Engineering 

9.  system  Support 

The  above  breakdown  is  based  on  grouping  of  tasks  of  similar 
nature  or  relating  to  a major  deliverable  element.  This  same 
breakdown  could  be  used  for  cost  estimating  as  well. 
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For  the  most  part,  the  scope  of  these  areas  are  self  evident. 
Program  Management  includes  the  monitoring,  review,  reporting  and 
control  of  the  overall  program  activities.  In  addition,  this  task 
includes  schedule,  cost  and  configuration  control,  generation  of 
procurement  and  production  releases,  subcontract  performance 
monitoring  and  liaison  with  customer  representatives.  The  last 
area,  System  Support,  covers  the  tasks  relating  to  reliability, 
maintainability,  human  factors,  spares,  documentation  and  manuals, 
and  training.  Intermediate  milestones  shown  in  these  two  task 
areas  are  not  major  events  but  represent  the  bounds  of  the  time 
periods  for  certain  emphasis. 

The  schedules  presented  assumes  that  most  system  integration  and 
testing  is  done  on  the  manufacturers  premises.  A trade  off, 
depending  on  the  availability  of  the  structure  for  housing  NASF, 
may  show  that  final  integration  and  testing  may  be  more  effective-^ 
ly  done  at  NASA  Ames,  possibly  shortening  the  schedule. 

8.1.3  Schedule  Management 

The  schedules  for  final  design,  fabrication,  integration  and 
installation  of  a large  system  such  as  the  NASF  Processing  Sytcem 
should  be  developed  on  a multilevel  basis.  The  first  level  should 
delineate  the  overall  program  showing  major  milestones  and  "deliv- 
erables” . 

Each  activity  for  these  tasks  areas  may  be  delineated  in  more 
detail  in  a second  level  of  scheduling.  These  include  such  activi- 
ties as  the  Integration  Plan,  or  the  Fabrication  and  Integration 
of  the  FMP.  A third  level  of  schedule  detail  further  delineates 
» the  major  activities  within  each  of  these  task  areas,  such  as  the 

■ design,  fabrication  and  testing  of  the  FMP  Processor. 

For  most  planning,  schedule  control  and  resource  management  this 
[ level  of  detail  is  sufficient.  However,  fourth  and  fifth  levels 

are  usually  desirable  for  specific  hardware,  software  and  documen- 
. tation  items  to  be  produced  or  for  individual  personnel  or  group 

I assignments. 

The  first  three  levels  are  best  managed  by  PERT  (Program  Evalu- 
I ation  Review  Technique)  type  schedules.  In  these  schedules  single 

‘ events  (start  and/or  completion  dates)  are  depicted  as  nodes.  The 

activities  or  tasks  to  be  accomplished  are  depicted  as  the  inter- 
1 ^jouuecting  lincc.  The  flow  represent  sequential  and  appro- 

j ximate  temporal  relationships.  Where  the  completion  ot  one  task 

is  a prerequisite  before  the  completion  or  beginning  of  another 
task  that  is  not  a natural  sequence,  a "dummy"  activity  is  shown. 
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The  result  of  this  graphic  representation  of  a group  of  activ- 
ities, is  a network,  showing  the  starting  event  and  activities, 
the  major  milestones  (events)  and  activities  required  to  accom- 
plish a desired  goal  or  goals  which  are  in  turn  shown  as  the  final 
event(s).  The  PERT  network  should  clearly  depict  the  interrela- 
tionships between  various  tasks* 

Once  the  time  elements  are  assigned  tc  the  tasks  of  a network,  the 
critical  path  can  be  ascertained.  The  critical  path  represents 
that  sequence  of  activities  required  for  completion  of  the  end 
objective  that  requires  the  longest  period  of  time;  that  is  to  say 
that  a single  day  (or  month)  slip  in  any  one  of  the  activities  in 
the  path,  will  result  in  a day  (or  month)  slip  in  the  overall 
schedule . 

One  of  the  many  advantages  of  PERT  is  that  it  lends  itself  to 
management,  maintenance  and  analysis  by  data  processing  techni- 
ques. Tnis  is  readily  accomplished  by  the  use  of  Burroughs  PROMTS 
(Project  Oriented  Management:  information  System)  which  has  many 
useful  management  outputs.  Activities  in  the  crxticai  path  are 
easily  identified.  The  slack  in  noncritical  activities  is 
reported.  The  range  of  acceptable  start  and  finish  dates  is 
provided.  Holidays,  overtime  and  shift  work  can  be  made  part  of 
the  schedule.  Flags  for  sorting  of  activities  by  discipline, 
organization,  or  other  keys  can  be  employed.  A PROMTS  data  base 
is  easily  updated  permitting  rapid  assessment  of  the  impact  of 
changes  or  other  new  inputs.  The  use  of  this  tool  in  initial 
planning  is  shown  in  the  discussion  of  the  schedules  that  follows. 

8.1.4  NASP  SCHEDULES 

Figure  8.1  illustrates  the  major  activities  of  the  nine  NASF  task 
areas  leading  to  the  achievement  of  the  final  program  goals  (also 
shown  as  deliverables).  The  interrelationships  between  some  of 
the  milestones  are  shown  with  arrows.  For  example,  the  completion 
of  the  final  design  and  specification  of  the  various  hardware  and 
software  elements  are  all  inputs  to  the  final  integration  plans 
and  system  analysis  efforts;  the  design  and  final  specifications 
of  the  hardware  items  is  needed  as  an  input  to  the  activity  that 
will  issue  the  final  facility  requirements  documentation.  For 
each  activity  an  expected  time  for  completion  is  indicated  below 
the  line  representing  the  activity. 

Each  event  is  given  an  identifying  number  which  is  used  in 
creating  the  data  base  for  analysis  and  reporting.  Table  8 1 
delineates  this  numbering  system  and  shows  it  to  the  lower  levels 
for  certain  categories. 
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- TABLE  8 . 1 

NASP  Event  Identification  Numbers 

Event  Numbers 

000000  - 099999 

100000  - 199999 
and  Test 

200000  - 299999 

300000  - 399999 

400000  - 399999 

500000  - 599999 

600000  - 699999 

700000  - 799999 

800000  - 899999 

210001  - 219999 

220001  - 229999 

230001  - 239999 

240001  - 249999 

1 250001  - 259999 

I 260001  - 269999 

' 270001  - 279999 

I 280001  - 289999 

I 
! 

1 

2 

I 


Task  Area 

Program  Management 

Systems  Management,  Integration 

Plow  Model  Processor 

Pile  Mrmory  Subsystem 

Support  Processor 

System  Software 

User  Support  Subsystem 

Facility  Engineering 

System  Support 

PMP  Processor 

PMP  Extended  Memory 

PMP  Connection  Network 

PMP  Coordinator 

PMP  Cabinets  and  Cables 

PMP  Power  Distributor 

PMP  Test  System 

PMP  Data  Base  Memory 
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To  demonstrate  the  application  of  management  tools  for  schedule 
monitoring,  analysis  and  control,  the  next  two  levels  of  schedule 
detail  for  specific  aspects  of  the  NASP  have  been  defined.  The 
schedule  for  the  fabrication  and  integration  of  the  PMP  which  is  a 
major  hardware  item  of  the  NASP  and  the  final  design,  fabrication 
and  testing  of  the  processors,  which  represent  a major  portion  of 
the  PMP  hardware  have  been  selected  for  further  delineation.  It 
is  quite  possible  that  the  critical  path  for  the  NASP  could  be 
dependent  on  activities  in  these  two  areas. 

Figure  8.2  takes  the  single  activity  "Fabricate  and  Integrate"  of 
the  Plow  Model  Processor  task  area  and  breaks  it  down  in  to  the 
next  level  of  detail.  The  first  node  of  this  schedule  corresponds 
to  the  second  node  of  the  PMP  path  on  the  program  schedule  in 
Figure  8.1?  the  last  node  corresponds  to  the  third  node  on  the 
program  schedule.  The  first  node  on  Figure  8.2  divides  (with  no 
time  allocation)  into  the  eight  major  elements  of  the  PMP. 

For  scheduling  purposes  a preferred  sequence  of  integration  is 
assumed.  The  first  point  of  integration  is  that  of  the  PMP  power 
distribution  system  with  the  PMP  cabinets  and  cables.  The  sche- 
dule then  calls  for  the  integration  of  the  coordinator  with 
the  use  of  the  PMP  Test  System  (which  will  include  the  PMP  diagnos- 
tic controller).  Not  all  of  the  PMP  cabinets,  cables  and  power 
distribtuion  system  are  required  for  the  installation,  checkout 
and  debugging  of  the  coordinator.  Completion  of  some  portions  of 
these  can  be  deferred  until  required.  This  level  of  detail  can  be 
included  on  the  next  lower  level  of  scheduling. 

The  installation  of  the  connection  network  is  next  followed  by  the 
installation  integration  and  checkout  of  the  processors  and  the 
extended  memory  modules.  The  end  events  of  the  processor  and 
extended  memory  activities  are  shown  as  only  two  tasks  for  each 
element,  "Install  First  Processor"  and  "Install  Last  Processor" 
and  "Install  First  Extended  Memory"  and  Install  Last  Extended 
Memory".  These  end  events  are  used  in  lieu  of  having  585*  indivi- 
dual inputs  representing  each  processor  and  extended  memory 
module.  A rather  large  series  of  activities  such  as  the  schedules 
for  each  of  the  processors  is  best  handled  by  a straight  forward 
status  list. 

Figure  8.3  further  delineates  the  detailed  activities  for  the 
processor  final  design,  fabricate  and  test  activity  shown  on 
Figure  8.2.  It  will  be  noted  that  there  are  several  different 
paths  leading  to  the  availabilty  of  the  58  5 processors.  The  upper 
most  path  shows  the  activities  for  the  design  and  procurement  of 
the  printed  circuit  board.  A second  and  third  path  are  the  activi- 
ties relating  to  the  design  and  development  of  the  processor 
tester  and  test  software.  The  lowest  path  which  merges  with  the 
tester  path  involves  the  design,  fabrication  and  evaluation  of  a 
prototype  processor. 

*The  current  estimates  for  the  number  of  processors  and  extended 
memory  modules  manufacturing  starts  is  585  which  takes  into 
account  shrinkage  and  spares. 
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FMP  Processo 


8* 1.5  Critical  Path 


The  critical  paths  for  the  schedules  shown  in  Figures  8.2  and  8.3 
were  determined  using  Burroughs  PROMIS.  A data  base  was  created 
listing  each  activity’s  starting  and  ending  events  the  mean  time 
to  complete  and  the  activity  description.  A hypothetical  start 
date  was  declared  and  a PROMIS  output  was  generated  providing  the 
earliest  and  latest  start  date^  earliest  and  latest  end  date  and 
the  amount  of  slack  in  each  activity.  A hypothetical  start  date 
in  calendar  terms  is  required,  since  PROMIS  uses  a real  calendar 
for  its  time  base.  This  is  done  to  permit  considerations  of  week- 
ends, holidays  and  for  convenience  of  reporting*  For  the  purposes 
of  this  demonstration  a start  date  of  1 July  1981  is  hypothosized 
for  the  beginning  of  the  final  design  of  the  FMP.  Figures  8.4  and 
8.5  show  the  PROMIS  outputs  for  the  schedules  in  Figures  8.2  and 
8.3.  Table  8.2  explains  the  abbreviations  used  on  the  PROMIS 
reports. 

The  critical  path  is  that  sequence  of  activities  which  show  zero 
slack.  In  Figure  8.4,  PROMIS  output  for  the  FMP  schedule  the 
critical  path  is  seen  as  being: 


Preceding 

Event 

Number 

240001 

252000 

253000 

254000 


Succeeding 

Event 

Number 

252000 

253000 

254000 

299000 


Activity  Description 

The  coordinator  design,  fabri- 
cation and  test. 

Installation  and  debugging  of 
the  coordinator 

Installation  and  debugging  of 
the  connection  network 

Initial  debugging  of  the  FMP. 

TOTAL 


Mean 

Time 

50  weeks 

12  weeks 

10  weeks 

20  weeks 
92  weeks 


There  is  a parallel  branch  in  the  critical  path  in  the  test  system 
design  and  fabrication.  Examination  indicates  42  weeks  slack  in 
the  design,  fabrication  and  testing  of  the  data  base  memory.  This 
shows  that  a starting  date  of  any  where  between  1 July  1981  and  21 
April  1982  would  not  impact  the  finish  date  of  5 April  1983  for 
that  schedule  element  based  on  the  estimate  of  50  weeks  for  its 
completion.  The  ability  to  determine  slack  permits  the  manager  to 
effective tly  allocate  resources  among  the  various  parallel 
activities. 
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TABLE  8.2 


PROMIS  Report  Terms 


HEADINGS 


PRED  NUMBER 
SUCC  NUMBER 
DESCRIPTION 
MEANTIME 

EARLIEST  START 
LATEST  START 

EARLIEST  FINISH 

LATEST  FINISH 

TOTAL  SLACK 


Preceeding  event  number 
Succeeding  event  number 
A brief  identification  of  the  activity 
An  Estimate  in  weeks  (unless  otherise  noted) 
of  the  time  expected  for  completion* 

The  earliest  date  that  an  activity  can  begin. 
The  lastest  date  that  an  activity  can  begin 
without  impacting  the  schedule. 

The  earliest  date  that  an  activity  can  be 
finished* 

The  latest  date  that  an  activity  can  be 
finished  without  impacting  the  schedule. 

The  amount  of  time  in  weeks  unless  otherwise 
designated)  in  escess  of  the  meantime  during 
which  a task  may  be  completed  and  not  impact 
the  schedule. 


ABBREVIATIONS 


PR 

Processor 

EM 

Extended  Memory 

CR 

Coordinator 

CN 

Connection  Network 

DBM 

Data  Base  Memory 

FMP 

Plow  Model  Processor 

POW 

Power 

DIST 

Distribution 

SYS 

System 

PCB 

Printed  Circuit  Board 

HDWR 

Hardware 

MATL 

Material 

DBS 

Design 

FAB 

Fabricate 

INST 

Install 

C.O. 

Checkout 

SPEC 

Specify 

MFG 

Manufacturing 

EVAL 

Evaluate 

Figure  8.5,  PROMIS  Report  for  Processor  Design  and  Fabrication, 
reveals  the  following  critical  path: 


Preceding  Succeeding 


Event 

Number 

Event 

Number 

Activity  Description 

Mean 

Time 

210001 

210005 

Detail  Design 

8 weeks 

210005 

210010 

Pinal  Design 

8 weeks 

(210005 

210035) 

Power  (Supply)  Design 

(8  weeks 

parallel  branch) 

210010 

210210 

Design  Tester 

12  weeks 

210210 

210230 

Design  Testor  Software 

8 weeks 

210230 

210250 

Develop  Tester  Software 

12  weeks 

210250 

218001 

Debug  Testor 

8 weeks 

218001 

218585 

Begin  Processor  Tests 

15  weeks 

218585 

219585 

Test  Last  Processor 

.5  weeks 

71.5  weeks 

The  71.5  week  period  begins  1 July  1981  and  ends  14  December  1982 
with  the  completion  of  the  off-line  testing  of  the  last  (585th) 
processor.  It  should  be  noted  that  since  there  ate  no  apparent 
constraints  on  the  requirement  date  of  the  first  processor  there 
appears  to  be  18  weeks  of  slack.  This  apparent  slack  disappears 
as  soon  as  the  requirement  for  the  availability  of  the  first 
processor  for  installation  into  the  FMP  (as  shown  in  Figure  8-2) 
is  considered.  Accordingly,  in  reality,  there  is  a branch  of  the 
critical  path  after  the  sixth  activity.  Debug  Tester,  which  is 
Test  First  Processor,  2 weeks.  This  results  in  a critical  path  of 
58  weeks  for  the  availability  of  the  first  processor.  This  same 
58  weeks  is  shown  as  the  time  for  the  first  activity  of  the  upper 
path  of  Figure  8.2. 
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FACILITIES 


Refinements  made  during  this  study  on  the  concept  of  the  NASF  as 
presented  in  the  initial  study  [1]  have  had  no  significant  impact 
on  the  facility  requirement  (also  presented  in  the  initial  study. 
Table  8.3  summarizes  these  facility  requirements  for  power  and 
floor  space  and  places  a maximum  limit  on  each.  Appendix 
General  Design  Guidelines  delineates  environmental  factors  for  the 
design  of  the  NASF  Processing  System  hardware*  These  same  limits 
should  be  consistent  with  the  environmental  capabilities  of  the 
physical  building. 


TABLE  8.3 

Summary  of  NASF  Power  and  Floor  Space  Requirements 


POWER 

ESTIMATE  555  KVA 


FLOOR  SPACE 

40 f 000  square  feet 

50^000  square  feet 


MAXIMUM 


750  KVA 
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APPENDIX  A 


PERFORMANCE  PROJECTION  BASED  ON  BENCHMARK  PROGRAMS  ^ 

A,1  INTRODUCTION 

The  four  programs  used  as  benchmarks  in  evaluating  the  design 
were; 


(1)  NASA  3D  implicit  aerodynamic  flow  (aero  flow)  code 
supplied  by  Ames 

(2)  NASA  3D  explicit  aerodynamic  flow  (aero  flow)  code 
supplied  by  Ames 

(3)  GISS  weather  code/  in  several  different  versions 

(4)  Spectral  weather  code  from  MIT 

Evaluations  of  the  first  three  were  comprehensive/  r<*sulting  in 
projections  of  1,01  Gf lops/sec  for  the  implicit/  0*89  Gf lops/sec 
for  the  explicit/  both  at  one  million  grid  points,  and  0*53 
Gflops/sec  for  the  GISS  weather  code* 

A range  of  throughput  values  from  zero  to  1*50  Gf  lops/second  for 
individual  code  sections  was  derived  from  the  simulation  efforts* 
These  variations  are  primarily  caused  by  the  relationships  of 
individual  subprograms  to  the  data  in  local  processor  memory  and 
extended  memory/  the  choice  of  mesh  size  and  the  choice  of  the 
metric  for  performance  measurement.  An  example  of  zero  throughput 
is  provided  by  the  subroutines  BCY/  BCZ  and  OUTER  in  the  3D 
explicit  aerodynamic  flow  code.  These  routines  shuffle  data  in 
data  arrays  in  the  EM.  As  no  floating  point  operations  are 
required  for  this  function  a zero  throughput  value  results.  Data 
sorting  algorithms  would  be  similar  examples. 

A throughput  value  of  1.45  Gf lops/ second  is  illustrated  by  the 
intrinsic  square  root  function.  Square  root  operates  entirely 
within  the  processor/  mostly  in  high  speed  local  registers.  Sub- 
stantial portions  of  the  simulated  codes  run  from  l.l  to  1.3 
Gf lops/second  rates.  Examples  of  this  are 

. subroutine  BTRI  in  the  implicit  code 

* subroutine  CHARAC  in  the  explicit  code 

* subroutine  LINKHO  in  the  GISS  code 

Examples  with  lower  throughput  values  typically  occurred  in  rou- 
tines where  a high  frequency  of  access  to  the  three  dimensional 
global  arrays  was  required.  The  ability  to  overlap  array  index 
calculations  with  floating  point  operations  is  limited  under  these 
conditions. 
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Performance  is  generally  increased  when  the  grid  size  is  incre-- 
ased.  The  3D  explicit  aerodynamic  flow  code  showed  0.79  Gflops 
for  30 f 000  grid  points  and  0.89  Gflops  for  1/000^000  grid  points. 

The  frequency  of  execution  of  individual  code  segments  must  be 
known  for  the  performance  evaluations.  Assumptions  were  made  in 
those  cases  where  data  dependent  loop  counts  and  branches  occur. 
Throughout  the  programs  a mean  value  rule  was  generally  employed 
with  an  occasional  reduction  to  some  more  conservative  value  where 
appropriate.  In  one  case,  CHARAC  from  the  3D  explicit,  simulation 
was  run  at  several  different  assumptions  to  test  the  sensitivity 
of  throughput  to  the  data  dependency  assumptions.  In  the  case  of 
CHARAC,  throughput  varied  no  more  than  15%. 

The  implicit  code  achieves  the  1.0  Gf lops/sec  throughput  r^te 
being  used  as  a guide.  The  explicit  code  appears  to  be  about  10^ 
slower  than  the  implicit  code. 

On  GISS  weather,  the  non-»vectorizable  portions  of  the  code 
exceeded  one  Gflops/sec  (COMP3),  while  the  vectorizable  portions 
(COMPl  and  COMP2)  were  slowed  down  by  EM  accessing  and  memory-to- 
memory  moves  that  produced  no  floating  point  operations. 


The  following  sections  discuss  the  methods  used  for  projecting 
performance.  Also  to  be  reviewed  are  each  program,  and  some  other 
applications,  namely  sorting  and  fast  Fourier  transforms. 

A. 2 METHOD 

The  method  used  for  performance  evaluation  v/as  generally  the  same 
for  all  of  the  first  three  benchmark  programs.  Because  of  time 

and  budget  limitations,  only  a cursory  look  was  taken  at  the 

Spectral  weather  code. 

First,  throughput  was  analyzed  on  the  basis  of  FMP  computations 
only.  I/O  operations  were  ignored.  Transfers  between  DBM  and 
file  system  are  independent  of,  and  go  on  in  parallel  with,  the 
FMP  computation*  It  is  assumed  that  the  file  manager  stages  the 
next  job,  and  unloads  the  last  job,  in  times  which  are  completely 
overlapped  with  current  computation.  DBM-EM  transfers  are  also 
ignored,  since  they  go  on  concurrently  with  current  processing  (as 
long  as  EM  space  is  available).  At  a transfer  rate  of  40  Mw/s, 
the  15  million  words  of  a restart  point  of  a typical  aero  flow 

code  are  loaded  in  0.375  seconds,  which  can  be  compared  with  the 
600  seconds  duration  of  a typical  run.  Hence,  even  when  not 

overlapable  because  of  EM  allocation  conflicts,  they  should  have 
little  effect  on  aero  flow  computations.  Therefore,  both  system 
I/O  and  user  I/O  were  ignored* 


Each  program  was  analyzed  to  Hind  the  calling  tree  of  its  sub- 
routines. Major  program  parameters  such  as  grid  size,  total 
number  of  time  steps,  etc*  were  then  established.  This  data 
allowed  the  determination  of  the  total  number  of  executions  of 
each  subroutine. 

An  analysis  of  all  data  declarations  was  then  performed  to 
establish  the  GLOBAIj  or  LOCAL  memory  palcement  of  all  major  vari- 
ables. This  analysis  also  determined  those  variables  that  were 
potential  type  INALL  variables*  The  programs  were  then  scanned  to 
establish  the  placement  of  the  DOALL  statement  construct  through- 
out the  program  structure.  This  information  determined  the  number 
of  parallel  machine  cycles  for  each  DOALL  and  the  processor  utiliz- 
ation level  number.  A handcount  was  then  performed  on  all  rou- 
tines to  determine  the  total  number  of  all  floating  point  oper- 
ations (f),  the  number  of  floating  point  divide  operations  and  the 
number  of  Extended  Memory  accesses  (rof).  Processor  utilization 
was  also  noted  for  each  code  sequence.  Next,  high  usage  sections 
of  typical  code  were  selected  for  hand  compiling  into  PMP  machine 
language.  Results  from  detailed  simulations  of  these  code  sec- 
tions were  then  used  to  develop  an  empirical  formula  used  to  inter- 
polate the  performance  of  code  sections  not  simulated.  This 
formula  is  a linear  function  of  the  number  of  floating  point 
operations,  the  number  of  floating  point  divide  operations  and  the 
number  of  extended  memory  accesses.  These  three  factors  are  suffi- 
cient to  fit  the  simulation  results,  after  constants  are  adjusted 
to  provide  agreement  with  detailed  simulation  results. 

The  following  symbol  definitions  pertain  to  the  equations  below: 

Tg  = Total  system  throughput  rate  - Gf lops/second 

Tp  = Single  processor  throughput  rate  - Gf lops/second 

Ef  = Total  floating  point  operations  - Flops 

E^  - Total  floating  point  divide  operations  over  2%  of  Ef 

Eji^  = Total  Extended  Memory  access  operations 

Ef  - Total  program  elapsed  time  (1  processor) 

Rf  s Ratio  of  active  to  total  processors 

System  throughput  is  then  defined  as: 
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The  linear  approximation  to  this  function  was  then  determined  as: 
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The  value  of  Kq  was  then  estimated  as  1.0  or  1*2  based  on  indivi- 
dual estimates  of  the  quantity  of  nonfloating  point  commands  in  a 
given  code  section.  Basic  system  throughput  could  then  be  calcu- 
lated knowing  the  individual  counts  of  floating  point  operations 
and  Extended  Memory  access  via 
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where  (ratio  of  active  to  total  processors)  was  determined  from 
the  analysis  of  parallel  DOALL  statements.  Where  the  formula  gave 
results  in  excess  of  1.33  Gf  lops/sec,  for  a particular  code 
sequence,  the  value  1.33  Gflops/sec  was  adopted  instead. 

The  above  formula  for  calculating  individual  code  segment  times 
assumed  that  two  percent  of  the  floating  point  operations  were 
divide  operations.  The  divide  instruction  consumes  1460  nano- 
seconds which  is  nearly  six  times  longer  than  the  estimated 
nominal  floating  point  instruction  time*  A special  count  of 
divide  instructions  was  therefore  included  in  the  analysis.  When 
this  count  exceeded  the  two  percent  rule  a correction  factor  of 
1460*  excess  count  was  added  into  the  above  time  calculation 
formula. 

Examples  of  exceptions  are  TRIB  and  EIGEN  in  the  implicit  (too 
many  divisions),  AVRX  in  the  GISS  weather  (too  much  integer  arith- 
metic and  data-dependent  processor  utilization). 
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Figure  A.l  plots  the  formula  used  against  the  results  of  simula- 
tions both  for  the  implicit  code,  the  explicit  code^  and  the  GISS 
weather  code.  It  is  seen  that  the  formula  is  validated  over  a 
large  assortment  of  "typical”  codes.  It  is  also  obvious  that  the 
formula  must  be  taken  with  a grain  of  salt,  and  that  each  and 
every  section  of  code  should  be  scrutinized  to  see  if  it  repre- 
sents some  exception  for  which  the  formula  will  not  work. 
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Figure  A.l  Throughput  Projection  Formula  vs.  Simulation  Results 


A. 3 THROUGHPUT  OF  IMPLICIT  AERO  FLOW  CODE 


A.3.1  Summary 

The  throughput  of  the  implicit  code  is  l^Ol  Gf lops/sec  for  the 
grid  size  of  100  x 50  x 200*  This  is  the  estimate  resulting  at 
the  end  of  the  analysis*  During  the  course  of  the  analysis,  as 
various  assumptions  and  corrections  were  being  applied,  the  esti- 
mate varied  from  0.973  Gf lops/sec  to  1.043  Gf lops/sec. 

A. 3.2  Assumptions 


The  following  were  the  assumptions  and  program  modifications  used 
to  produce  this  result.  Examples  of  the  resulting  code  are 
included  in  sections  which  follow.  In  addition.  Appendix  G has  a 
side-by-side  comparison  of  some  of  the  original  codes  and  the  FMP 
codes . 

All  variables  indexed  on  the  three  grid  variables  J,  K,  L were 
assumed  to  be  STRUCTURE  arrays  resident  in  Extended  Memory.  In 
one  case,  the  accessing  pattern  was  such  that  the  variables  could 
be  assuiued  to  be  resident  in  Processor  Memory.  In  this  case,  an 
instance  being  executed  on  a processor  v/as  able  to  access  the 
STRUCTURE  variables  without  having  the  time  penalty  of  EM  acces- 
sing. Hot  much  improvement  is  expected  when  more  of  the  STRUCTURE 
arrays  are  processor  resident. 

The  grid  size  is  IMAX,  JMAX,  KMAX  = 100,  50,  200 

The  compiler  is  able  to  use  a MAD  or  FADEXL  instruction  when  one 
is  appropriate,  and  to  reorder  arithmetic  expressions.  For 
example,  the  expression  (A  + B*C/2)  would  be  implemented  with  a 
FADEXL  and  a FMAD. 

I/O  operations  are  ignored. 

NMAX  = 100,  arbitrarily. 

All  arrays  declared  as  A(720,6,30)  where  the  720  dimension  is 
indexed  on  KL  = (L-1)*ND+K,  and  the  30  dimension  on  J are  assumed 
to  be  changed  to  A(  100 , 50, 200 , 6)  where  the  subscripts  used  will  be 
J,  K,  L,  and  whatever,  respectively. 

A total  of  94  separate  sequences  of  code  were  identified. 

Ihe  computation  of  RESID  at  the  beginning  of  STEP  is  assumed  to  be 
a SUMALL  over  the  domain  J=l,100;  K«l,50;  L=l,200.  With  1920 
cycles  in  this  DOALL,  the  9 extra  steps  at  the  end  for  the  SUMALL 
are  insignificant. 
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All  calls  on  subroutines  XXM,  YYMf  and  ZZM  were  brought  up  into 
line»  Further,  the  resulting  code  was  put  down  into  the  DO  loop 
that  normally  follows  such  calls  so  that  XX,  YY,  and  ZZ  are  recom*- 
puted  each  time.  The  result  is  that  the  four  result  values 
produced  by  each  single  iteration,  within  the  former  XXM,  YYM,  or 
ZZM,  are  used  immediately,  and  can  be  LOCAL  variables.  If  the 
program  were  left  in  its  present  structure,  v>rhere  all  elements  of 
the  arrays  XX,  YY,  and  ZZ  are  computed  at  one  time,  the  arrays 
would  have  to  be  either  INALL,  with  100-fold  waste  of  memory 
space,  or  GLOBAL,  with  100-way  access  conflicts  in  memory  when 
these  one-dimensional  arrays  are  used  in  two-dimensional  DOALLs. 
By  computing  these  elements  one  at  a time  at  the  point  whore  they 
are  used,  the  memory  to  store  them  is  saved*  The  amount  of  compu- 
tation does  not  change  but  several  copies  of  in-line  code  are 
needed  to  replace  each  such  subroutine. 

In  VISRHS,  and  in  BTRI,  essentially  identical  code  is  seen  repli- 
cated. One  copy  is  executed  at  one  end  point  (say  1-1 ) , the  other 
copy  is  executed  at  the  other  end  point  (say  I-IMAX) , and  the 
third  copy  is  inside  a DO  I~2,  IMAX-1  loop.  In  VIBRHS  these  three 
cases  were  subsumed  into  a single  DO  I“1 , IMAX  loop.  In  BTRI,  an 
observation  on  the  incoming  data  shows  that  the  first  iteration  is 
degenerate  (a  diagonal  matrix  is  being  decomposed  , which  is 
nearly  a no-op)  , so  the  first  copy  is  rewritten,  and  the  latter 
two  combined  into  a DO  I«2,IMAX  loop. 

SMOOTH  was  rewritten  into  a single  three-dimensional  DOALL. 

Only  those  named  common  areas  that  are  actually  used  in  a program 
unit  are  declared*  This  improves  FMP  operation,  speeds  up  sub- 
routine entry  and  sometimes  releases  memory  space* 

Where  feasible,  divisions  were  replaced  by  multiplication  by  the 
reciprocal,  including  every  division  by  a literal* 

In  doubly  nested  DO  loops  with  simple  subscripting  (DO  N=l,5  and 
DO  M=l,5),  the  code  is  assumed  restructured  either  by  the  program- 
mer or  by  a later  optimizing  version  of  the  compiler  such  that 
there  is  no  more  than  one  integer  multiply  per  set  of  subscripts* 
For  example,  one  can  increment  auxiliary  index  variables  per 
iteration*  Two  such  loops  contain  26  percent  of  all  the  floating 
point  operations  in  the  program* 

A*3*3  Analysis  of  Implicit  Aero  Flow  Code 

Equation  A. 6 (Section  A*2)  is  an  extrapolation  of  the  simulation 
results  to  the  portions  of  the  code  that  were  not  simulated. 
About  60  percent  of  the  running  time  of  the  implicit  code  is 
represented  in  the  two  simulations  that  were  done,  namely  sub- 
routine BTRI  and  the  portion  of  subroutine  RHS  that  used  to  be  sub- 
routine AMATRX  in  a previous  version  of  the  program.  This  grati- 
fyingly  high  percentage  of  execution  actually  simulated  arises 


because  BTPI  itself  represents  over  55  percent  of  the  computation 
of  the  implicit  code#  One  statement  in  BTRI  which  is  executed 
25,000,000  times  during  the  course  of  the  program,  represents  21 
percent  of  all  the  floating  point  operations  in  the  entire 
program,  and  is  found  in  the  test  case# 

Exceptions  to  Equation  A# 6 are  code  sequences  in  TRIB,  EIGEN,  and 
INITIA  with  an  atypically  high  proportion  of  divides#  These  are 
executed  so  infrequently  as  to  disappear  from  the  total  throughput 
figure.  At  the  beginning  of  BC  there  is  a section  that  could  have 
been  implemented  as  a series  of  SUMALL’s.  In  this  analysis,  the 
summations  wete  done  serially  v. ith  38  percent  processor  utiliz- 
ation instead.  On  the  other  hand,  a SUMALL  was  used  at  the  begin- 
ning of  STEP  to  compute  the  variable  RESID.  This  runs  with  ’’typi- 
cal" speed  because  of  the  size  of  the  DOALL,  which  is  across  all 
three  dimensions,  or  1,000,000  instances,  so  that  the  final  512- 
way  summation  takes  negligible  time  compared  to  the  1920  cycles  in 
the  DOALL.  The  processor  utilization  for  this  case  is  99.97 
percent. 

"AMATRX"  was  simulated.  It  is  the  part  of  the  subroutine  STEP  so 
identified  in  a line  of  coimnent.  The  test  case  consisted  of  3750 
floating  point  operations  per  processor,  achieved  by  iterating 
several  times  around  the  code.  Hence,  the  frequency  of  execution 
of  loop  control  was  somewhat  higher  than  in  the  actual  case  in 
STEP,  where  additional  operations  are  in  the  same  loop.  The 
observed  time  is  counted  in  cJocks  per  processor.  At  40  ns  per 
clock,  and  512  processors,  this  computes  to  1.330  Gf lops/sec  for 
the  entire  FflP.  Overlap  between  the  several  execution  units 
within  the  processor  was  such  that  on  the  average  there  were  1.20 
instructions  in  the  course  of  execution  at  any  one  time. 

"BTRI"  was  also  simulated.  The  test  case  was  constructed  by 
taking  the  doubly-nested  DO  loop  identified  by  the  comment 
’’COMPUTE  B PRIME**,  and  following  it  with  one  pass  through  "INSERT 
LUDEC  AGAIN",  and  wrapping  an  outer  loop  around  both.  There  were 
a total  of  650  floating  point  operations  executed  in  simulation. 
For  present  purposes,  it  is  instructive  to  separate  the  500  oper- 
ations executed  during  the  doubly  nested  loop,  and  the  150  exe- 
cuted in  LUDEC.  The  LUDEC  protion  of  the  simulation  executed  at 
1.30  Gflcps/sec,  while  the  doubly  nested  loop  executed  at  1.170 
Gflops/sec,  at  512  processors  busy. 

Hence,  the  assumption  of  1.33  Gflops/sec  for  "ordinary"  code  execu- 
tion speed  where  ail  variables  are  local  to  the  processor  is  justi- 
fied. However,  when  single  statements  are  found  inside  doubly 
nested  DO  loops  with  triple  subscripting  on  most  of  the  arithmetic 
primaries,  performance  is  derated  to  1.17  Gflops/sec.  The  two 
swatches  of  code  deserving  this  derating  are  the  loops  in  BTRI, 
and  similar  loop  in  VISMAT.  The  simulation  printout  associated 
with  this  loop  in  test  case  "BTRI”  shows  that  14.6  percent  of  the 
time  the  processor  was  waiting  for  instruction  fetch.  These  were 
primarily  integer  instructions  associated  with  subscript  computa- 
tions. A sequence  of  integer  lADDs,  for  example,  can  be  executed 
faster  than  the  instructions  can  be  fetched. 
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A. 3.4  FMP  FORTRAN  Version 

A. 3. 4.1  Ono-to-ono  Mapping  from  Serial  FORTRAN 

There  is  a simple  one”for~one  translation  from  FORTRAN  furnished 
by  NASA  into  FMP  FORTRAN  as  follows.  All  arrays  subscripted  with 
the  grid  variables  are  made  STRUCTURE.  DO  loops  {single  or 
nested)  on  the  grid  variables  are  automatically  turned  into  DOALLs 
as  long  as  the  data  dependence  allows  it.  Temporary  variables  are 
allowed  to  be  LOCAL  by  default.  The  implicit  code,  as  supplied  by 
NASA,  is  of  such  regularity  that  practically  all  of  it  can  bo 
transformed  into  FMP  FOTRAN  using  such  simple  rules.  Because  of 
this,  and  in  order  to  save  time,  most  of  the  PHP  FORTRAN  version 
of  the  implicit  aero  code  was  not  even  written  down,  since  it  was 
obvious  from  the  NASA-furnished  version  by  inspection. 

SMOOTH  and  BTRI  were  rewritten  to  better  match  the  structure  of 
the  FMP.  Discussion  follows. 

A. 3. 4. 2 SMOOTH 

A revised  FORTRAN  version  of  subroutine  SMOOTH  is  exhibited  in 
Figure  A.  2.  All  computation  is  put  into  a three-dimensional 
DOALL.  Note  that  the  arrays  Q and  S (which  have  total  dimension- 
ality 0(100,50,200,6)  and  3(100,50,200,5))  are  defined  as 
STRUCTURE  variables  since  they  are  included  in  both  an  INALL  state- 
ment and  in  a USING  clause  over  the  domain.  These  arrays  would 
exist  in  Extended  Memory.  The  other  variables  defined  over  the 
structure  (SS,  CT,  and  the  temporaries  Tl,  T2,  T3,  and  T4)  are 
allocated  space  within  each  processor.  Note  that  only  S3  and  CT 
must  be  unique  to  an  instance  over  the  sections  of  the  DOALL.  The 
temporaries  could  share  storage  with  other  instances.  Computa- 
tions on  SS  and  CT,  having  10^  elements  uniformly  distributed  over 
the  processors,  will  take  up  1862  (cycles)  * 6 = 11178  words  of 
processor  memory  during  the  execution  of  subroutine  SMOOTH. 

The  other  large  user  of  processor  memory  space  is  BTRID,  a LOCAL 
COMMON  area  which  must  be  declared  inside  the  DOALLs  of  STEP  so  it 
can  be  common  to  the  calls  on  BTRI.  Here  is  an  example  of  the  use 
of  dynamic  memory  allocation.  Upon  leaving  the  last  DOALL  in 
STEP,  this  LOCAL  COMMON  is  deallocated,  leaving  space  for  S3  and 
CT  to  be  allocated  during  SMOOTH.  See  the  following  section  on 
the  rewritten  BTRI  for  further  discussion. 

Temporary  varibles  TEMP,  Tl,  T2,  T3,  and  T4  were  used  to  hold 
copies  of  STRUCTURE  array  elements  so  that  they  could  be  used 
through  several  operations  with  only  one  fetch  from  EM. 

The  statement  NEXTDO,  used  in  this  code  but  not  explained  in 
Chapter  4,  is  a convenience.  The  NEXTDO  statement  is  shorthand 
for  an  ENDDO  statement  followed  immediately  by  another  DOALL  on 
the  same  domain. 
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iOO  SUBROUTINE  SMOOTH 

20  0 COHMON/BftSE/NMfiX,  UMAX,  KMflX,L.HflX,OT,(3AMMR,GRMl,  FSMftCH, 

3^0  i OXi, DVX  j 02X,rU<5),F0(5)| HO, ftUP, 60, OMEGA, HDX , HDY , H02 , RM , 

400  2 CHBR, PI , ITR , NR, INTI, IHT£, I NT3 

4i0  DOMAIN  /MDOEL/<  J»i,X00;  KsX,S0J  U«X,200 

420  REGION  /THREeO<<U«£,UMAX-X) , ( K*=£ , KMAX-X ) , (U , UM AX-i  ) ) / 

•>  30  w tt  /MDOELC  J,  K, 

440  INALU  /MODEL/  , S<5  > , SS,  CT  (5)  , 7X,  T2,  T3,  T4 

71^0  C HTH  order  smoothing,  20  ORDER  AT  THE  BOUNDARIES 
XOOO  DOALU  /THRrED< J, K,L^/  ; USING  A,  S,  SMU 

X200  TEMP  s X, /A( J, K, L, 6> 

XB'^'O  DO  X H«i,5 

X400  CT(N>s  G< J, K, L, H J, K, L, 

X500  X CONTINUE 

X600  IF  (J.EA.i:  ,DR,  J.EG.JMAX-i>  THEN 

1700  TX  s <5i<  JrX, 

X800  T2  c Q<J-X,K,L,8; 

X900  DO  2 Nsx,5 

2000  SS  = S(J,K,L,U^  » 0 , iixSMUR  (6(  Jf  X,  K,  U,  N ^ RTX  ) - £,)«iCT<W)  *r 

2X00  X Q(J-X, K, L,N^RT£>ATEMP 

E£00  2 CONTINUE 

2300  ELSE 

£400  DO  3 N«X,5 

£500  TX«A( J+2, K, L, 6> 

£600  7£sQ< J-£, K, L, G> 

£700  T3aA( JtX,H,L, 6/ 

£300  T4aA<;U**X,  K,  L,  6> 

2^00  SS  s S(J,K,L,N;  -r  SMU«<  A < Jt£,  K,  L,  N^XTI  + A ( J~£,  K,  L , N ; xTP 

X 4.x(A(  JrX,K,L,N^xT3  + A(  J-X,  K,  L,  N^xTlf  ) - 6 , xCT  ( N ^ > AT  U*  MP 

3X00  3 CONTINUE 

3200  ENOIF 

5300  NEXTDO 

3400  IF  <K,EA,£  ,0R,  K,EA,kMAX-X>  THEN 

3500  TlnQ< J, KtX, L, 6> 

3600  T£sq( j, K-X , L, 6> 

3700  DO  4 Nsx,5 

3300  SS  = SS  + U, 5xSMUA(A( J, K+X, L, N/ATX  + A ( J , K-X , L , N ^xT£  - 

3900  X £.xCT<N)>ATEMP 

4000  4 CONTINUE 


Figure  A* 2 FMP  FORTRAN  Version  of  SMOOTH 
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4500 

T4»a< K-1, u, 6> 

4600 

DD  5 N»l,5 

4700 

j5SaSS  + SMU)«cC€K  J,  K + E,  L,  N;>icTl  + (S  ( J , K-E  , L , N ; wXE  + 

4600 

X 

4,  }«((«<  J,  K + X,  L,  N;xT3  + Q(  J,  K-i,  L,  N)AT4)  - 6 , ytCT  C N ) ) *TEMP 

4900 

5 

CONTINUE 

5000 

ENDIF 

5100 

NEXTDQ 

5200 

IF  (L.EOt.S  .OR.  L.  EG,  UMftX-X)  THEN 

5300 

T1sG<J,K,LT1,6> 

5400 

T2=G(U,K,L-i,6; 

5500 

DO  6 N»X,5 

5600 

S(J,K,U,N>  3 SS  + 0.5ASMU3«t<Q<J,K,LTl,N)*TX  + 

5700 

G(J,K,U-X,N>)»cT£  - S , ACT  < N > ) ATEMP 

5800 

6 

CONTINUE 

5900 

ELSE 

6000 

Tl  s Q<U,K,Lt£,6> 

6100 

T£  a G(J,K,L-£,6) 

6200 

T3  a G<J,K,LtX,6> 

6300 

T4  a G(U,K,L-X,6> 

6400 

DO  7 NaX,5 

6500 

S(J,K,L,N>  a S:5  + SMUa  ( Q ( J , K , LtE  , N ; ATI  + G ( J , K , L-*E  , N > ATE  + 

6600 

X 

4.  AGtUfKjLl'lfN^ATS  + 4.  AG<  J,  K,  L-X,  W>AT4 

6700 

e 

- 6. ACT(N> >ATEMP 

6800 

7 

CONTINUE 

6900 

ENDIF 

7000 

ENDDO  /THREED/  J GIVINS  S 

7200 

RETURN 

7300 

END 

Figure  A. 2 FMP  FORTRAN  Version  of  SMOOTH  (Cont'd) 


The  resulting  rewrite  of  SMOOTH  reduces  the  number  of  flops  from 
225  X 10®  to  201  X 10®,  and  the  number  of  EM  accesses  from  195  x 
10®  to  72  X 10®,  as  compared  to  a mechanical  translation  of  DO 
loops  to  DOALLs*  Thus,  the  time  is  improved  more  than  the 
throughput* 

A. 3*4*3  BTRI 

The  subroutine  BTRI  was  also  modified*  Observe  that  when  BTRI  is 
entered,  array  B is  a diagonal  matrix  with  zeros  off  the  diagonal* 
The  first  piece  of  code,  which  is  a copy  of  LUDEC,  therefore 
executes  with  most  of  its  Input  variables  equal  to  zero*  LUDEC  is 
a modified  Cholesky  decomposition*  V^en  faced  with  a diagonal 
matrix,  it  produces  a copy  of  that  matrix  for  the  lower  triangular 
matrix,  and  produces  the  identity  matrix  for  the  upper  triangular 
matrix.  In  BTRI  the  variables  Lll,  L22,  L33,  L44,  and  L55  are  the 
reciprocal  of  the  diagonal  terms,  in  order  to  save  repeated  unnec- 
essary divisons  later  on*  The  diagonal  terms  of  the  upper  trian- 
gular matrix  are  unconditionally  equal  to  1*0  and  hence  are  not 
computed. 

The  first  copy  of  the  former  LUDEC,  as  shown  in  NASA’s  BTRI,  can 
be  simplified  to  the  version  shown  in  the  attached  listing.  Figure 
A. 3.  The  last  iteration  of  the  former  LUDEC,  at  index  equal  to 
lUA,  differs  from  the  central  iterations  only  by  the  omission  of 
the  computation  of  C PRIME.  To  simplify  the  source  code,  this 
copy  was  pulled  into  the  main  iteration  in  BTRI. 

Common  area  BTRID  would  be  declared  in  STEP: 

LOCAL  COMMON/BTRID/  A(LMAX,5,5),  B(LMAX,5,5),  (CLMAX,5,5), 
D(LMAC,5,5),  F(LMAX,5)  in  that  call  on  BTRI  in  which  the  limiting 
index  is  LMAX  using  the  one-for-one  translation  of  the  original. 

With  LMAX=200,  this  means  that  common  BTRID  is  21,000  words  long. 
When  the  extent  is  JMAX,  BTRID  will  take  10,500  words  and  when  the 
extent  is  KMAX,  BTRID  will  be  allocated  5,250  words.  Note  that  in 
STEP,  where  this  COMMON  is  initially  specified,  it  is  not  declared 
in  a USING  or  GIVING  statement.  For  this  reason,  it  is  a LOCAL 
area  allocated  within  each  processor.  The  copy  of  the  subroutine 
BTRI  resident  in  each  processor  accesses  the  common  area  in  that 
processor.  By  the  time  that  BTRI  is  executing,  the  current 
instance  of  STEP  would  have  initialized  the  appropriate  part  of 
that  common  block. 

Within  the  separately  compiled  subroutine  BTRI,  the  declaration  of 
BTRI  takes  the  form: 

COMMON  /BTRID/  A(IUA,5,5),  B(IUA,5,5),  C(IUA,5,5), 

1 D(IUA,5,5),  F(IUA,5) 


SUBROUTINK  BTRKIUR; 


iOO 
iiO  C 

i20  c 

X30  C 

mo 

j.50 
X60 
i70 
X30  c 
X90  C 
20  0 C 
210 
2£0 
230 
a^o 

250 
260  C 
270  C 
280  C 
290  C 
300 
310 
320 
330 
340 
350  C 
360  C 
370  C 
380 
390  C 
400  C 
mo  c 
420  C 
430 
440 
450 
460 
470 


ftSSUWe  STARTING  INDEX  s 1 


COMMON  /BTRID/  fl<IUA,5,5),  B<IUA,5,5>,  C<IUA,5,5>, 
1 D(IUA,5,5>,  r<rUA,5> 

DIMENSION  H(5,5> 

IMPLICIT  REAL<Lj 


INSERT  LUDEC  ^SIMPLIFIED  FDR  DIA6DNAL  INPUT  ARRAY  FOR  Isl 

Lii  c 1,/B(;i»i,ip 
L22  s X»/BCl|^2y2) 

L33  x,/aa,3,3> 

L44  « l./e<l,4,4> 
l55  s 1,/B(lt5,5> 

COMPUTE  LITTLE  R«S  OMITTED,  THESE  TEMPORARIES  NOT  NEEDED 
THIS  PASS,  COMPUTE  BIG  R«S 

F<1,5>  Si  L55 
F<1,4)  s L44 
F<1,3)  s L33 
F<1,2>  s L22 
F<1,1)  s Lll 

COMPUTE  C PRIME  FDR  FIRST  RDM 


u-i-.  I . ii  .rrv  OF  the 

ORIGINAL  PAGE  IS  POOR 


DO  12  M s 1,5 

C HAS  BEEN  ELIMINATED  AS  A SIMPLE 
RESUBSCRIPTING  OF  THE  0 ARRAY 


B<1,5,M>  != 
B<i,4,M>  s: 
B(1,3,M> 
S<1,2,M>  a 
a 

CONTINUE 


L55  X C<I,5,M> 
L44  X C(I,4,M^ 
l33  X C(I,3,M; 
L22  X C<I,2,M^ 
Lll  X C<t,l,M> 


Figure  A. 3 FMP  FORTRAN  Version  of  BTRl 
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^90 

C 

50  0 

C 

HERE  NOW  STARTS  THE  MAIM  LOOP  OF  B7R1 

510 

C 

5£0 

DO  13  I s:  £,  lUft 

530 

C 

540 

c 

COMPUTE  B PRIME  A BI6R 

550 

c 

560 

DO  14  Nsl,5 

570 

14 

F<1,N;  u F<I,N)  - Ai.  I,W,i^  A F<  I~l,x)  - A<I,N,2>  A 

560 

1 

“ Av,l5Wj3i>  A FC*-**lj«jJ*  “ ft\,IjHj4/  A 

590 

3 

F(I-x,4^  - A F<I“i,5> 

600 

c 

610 

c 

COMPUTE  B PRIME 

6E0 

c 

630 

DO  11  H s x,5 

640 

DO  11  M « 1,5 

650 

il 

H*:n,M>  ::  B<I,N,M/  - A < I , N , x > A E<I-l,x,M^  - 

660 

JU. 

A<I,W,2>  A bCl-XjEjM;  - A(l,tJ,C>  A 

670 

2 

B ( I ••  1 , , M > ~ A V I , N , 4 J A E (I  -X  , 4 , M / - 

660 

3 

A<I,H,5>  A B<I~1,5,M^ 

690 

c 

70  0 

c 

LUDEIC  6GRIN 

710 

c 

7E0 

730 

7U0 

HERE  SHALL  BE  INSERTED  ft  CPPY  OF  THE  FORMER  LUDEC, 

750 

EXACTLY  ftS  SHOWN  IN  THE  IMPLICIT  CODE  COHPILftTION  BY  SCHAEFFER 

760 

770 

760 

c 

790 

c 

COMPUTE  LITTLE  K '■  S 

COO 

c 

610 

01 

s Lxi  A r< I , i; 

CEO 

DE 

s LEE  A CF<I,E)  - LEx  a Di) 

330 

03 

s l33  a (F<I,3>  - L31  A Di  - l3E  a DE> 

340 

04 

=s  L44  a ^F(I,4>  - L4x  a Dx  - L4E  A D2  - L43  A D3> 

850 

05 

= l55a<F(I,5>  - l51ADx  - l5EadE  - l53a03  - l54aD4.j 

Figure  A. 3 PMP  FORTRAN  Version  of  BTRI  (Cont'd) 


360 

870 

880 

090 

900 

910 

920 

930 

940 

950 

960 

970 

900 

990 

iOOO 

JlOiO 

i0£0 

1030 

1040 

1050 

1060 

1070 

1080 

1090 

iiOO 

1110 

1120 

1130 

1140 

1150 

1160 

1170 

1180 

1190 

1200 


C 

C 

c 


15 

13 

C 

c 

c 

c 

c 


19 

20 


CDMPUTC  BIG  Ri$ 


ra,5>  « d5 
F<I,4>  k d4  -«U45}«c05 
r(I,3)  U d3  - U34xF<I,4>  - 
F(X,2>  a 02  - u23xF(I,3>  - vu.-T,.r 
F<I,1)  a Dl  U12aF<I,2>  - U13XF 
IF  (I  ,L7.  XVft~)  THEN 
on  15  M a 1,5 

01  a Lii)«cC<  I , I,  M; 

02  a L22A<C< I , 2, M)  - L2iX0i> 

03  a l33x<C<I,3,M>  **  l31w01 

04  a U44x<C(l,4,M^  - l4ixD1 
05  a l55x<C<;  Ij  5,H;  - L'iiXDl 

a 05 

bO,4,M)  a d4  - U45XD5 


U35xD5 

u24xF(i,4;  - U2S«d5 

U,3>  - Ui4xF(I,4>  - U15X05 


U325<c02> 

U42X02  - L43x03) 
l52xd2  - ;.53x03  - u54xd4) 


B<I,4,M)  a 04  - U45XD5 

B<I,3,M>  a 03  « U34xC<I,4,M)  - 035x05 

b<:,2,M>  a o2  - U23XB(I,3,M)  - u24xB<X,4,M)  - U25xo5 

B<I,i,M;  a 01  - U12XB<I,2,M>  - Ul3XBaj3^^>mTT1«Vi^’*B<I,4,M; 

uiSxoS  OF 


CONTINUE 
ENDIF 
"mjTT  wiir 


THIS  IS  THE  END  OF  THE  MAIN  I LOOP,  INCLUOINS  lalUfl 

NOTE  THE  NEGATIVE  CODE  INCREMENTS  IN  THE  NEXT  SECTION 

DO  20  I a lUA-i,  1,  *♦! 

DO  19  Nal,5 

F<I,N>  a F<I,N>  - F(I+i,lMBU,H,i>  ~ F < I + 1 , 2 > X B < I , N , 2 > 
CONTINUE 
RETURN 
END 


Figure  A. 3 FMP  FORTRAN  Version  of  BTRI  (Cont'd) 


(Note:  If  the  programmer  is  comfortable  only  with  literal  extents 

on  arrys,  all  these  declarations  could  be  replaced  by  COMMON/ 
BTRID/  A(200,5,5),  B(200,5,5),  C(200,5,5),  D(200,5r5),  F(200,5) 
which f in  the  present  instance,  merely  allocates  some  memory  that 
was  going  to  remain  unused  in  any  event.) 

For  handling  a larger  mesh,  note  that  only  the  diagonal  elements 
of  A and  C serve  any  real  purpose.  All  off-diagonal  element  are 
simply  copies  of  elements  of  array  D with  offset  subscripts. 
Thus,  with  substantial  complication,  due  to  testing  to  see  which 
array  should  be  fetched  at  any  given  time,  BTRID  could  be  shoe- 
horned into  13,000  words  for  the  200-long  dimension,  or  into 
19,500  words  for  an  LMAX  of  300.  The  present  analysis  ignores 
this  possibility. 

A. 3. 5 Analysis 

Figure  A.  4 shows  the  sections  of  code  into  which  the  implicit 
program  was  dissected  for  the  sake  of  analysis.  Subsequent  to 
this  analysis,  it  was  determined  that  all  calls  on  subroutines 
XXM,  YYM,  and  Z2M  should  be  brought  up  into  line  primarily  to 

avoid  unnecessary  saving  of  temporary  variables,  as  described 
above  under  “assumptions**. 

Table  A.  1 shows  some  of  the  data  abstracted  from  these  sections. 
In  Table  A.l  the  subroutines  XXM,  YYM,  and  Z2M  have  been  combined 
into  their  callers. 

Table  A. 2 shows  this  data  recombined  into  an  estimate  of  overall 
throughput.  Rather  than  clutter  this  appendix  with  all  inter- 
mediate computations,  Table  A*  2 has  the  results  accumulated  by 

“group**,  where  a group  is  a group  of  swatches  all  with  the  same 

multiplier,  and  the  same,  or  approximately  the  same,  processor 
utilization  percentage.  Three  subtotals  are  exhibited.  The  first 
subtotal  includes  all  the  easy  parts  of  the  code,  iterations  or 

instances  which  are  at  least  triply  nested  on  the  four  main 
indices,  J,  K,  L,  and  N the  time  step.  The  second  subtotal 
includes  all  those  swathces  of  code  that  are  essentially  negli- 
gible. Most  of  the  time  here  represents  serial  computation  where 
all  the  processors  are  computing  CONTROL  variables  in  parallel  but 
only  getting  credit  for  the  one  operation  that  it  would  take  in  a 
serial  machine  to  compute  this  value.  The  third  subtotal  gathers 
together  some  operations  where  one-dimensional  DOALLs,  with  fewer 
instances  than  there  are  processors,  result  in  low  processor 
utilization.  Even  so,  operations  with  low  processor  utilization 
are  essentially  negligible  at  the  problem  size  considered  here. 
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AIR3D 


BIGEN (LKall) 

INITIA-pJKLall. 

f-JKall. 


hGRID- 


METOUT. 
•JACOB  — 


•(NloopVSPlN- 

hSTEP- 


t: 


(XXM) 

EIGENloop 


JKLall. 

GRIDIO. 

•(JKall)- 


•XXMloop. 


U-(JLall). 
L— (KLall). 


tAIR3DIO. 

(PLANE)-KLall- 


JKloop. 
■ JLloop. 
•KLloop. 


— (RHS)' 


-JKLall. 
-BC 


•(Lall)~ 
I — JKall — 

Kali 

Jail 

tKLall. 
JLall. 
•(JKall)- 


— (JLall) 
— (KLall) 
— VISRHS 


(SMOOTH) 

(KLall) 


h—  ( Jliall ) 


T 


[—(JKall) 


JKLall* 

(XXM) 

BTRI 

KLloop. 

(YYM) 

BTRI 

JLloop. 

(22M) 

1— VISMAT- 


I — JKloop, 
BTRI 


* Lloop. 

‘ JK3L, 

. TRIB — 
TRIB  — 


► TRIBloop. 

► TRIBloop, 


T — JKloop* 
L.  (ZZM)  — 
-p—  JLloop  * 
U-(yyM) — 
KLloop* 
(XXM)* 


ZZMloop. 

YYMloop* 

XXMloop* 


I 


~ "j  ■■  JKloop* 


(JKall) 

(ZZM) 
(MUTUR)~|— JKall* 
L-(2ZM)- 


ZZMloop* 

MUTURloop* 

ZZMloop* 


►XXMloop* 

• LUDECloop  — BTRI  loop, 
•YYMloop. 

* LUDECloop  — BTRIloop, 


ZZMloop. 

r{22M) 

Vloop 


ZZMloop, 

V251oop, 


i-STEPSUM- 


-STEPSUMall, 


► LUDECloop  — BTRIloop, 


hOUTPUTIO, 

‘-JKLall* 


PLANEIO* 


Key:  CAPS  « Program  units* 

-all  - DOALL  over  indicated  variables* 

- loop  = DO  loop* 

( ) = null  node  except  for  entry  and  return* 


Figure  A. 4 Breakdown  of  Implicit  Code  into  Segments  of  Code 

and  Nodes  for  Analysis 


A-17 


A.3»5«l  Description  of  Table 


In  the  ^multiplier”  column^  K and  L are  abbreviations  for 
UMAX,  JMAX,  KMAX  and  LMAX  respectively.  Below  that^  is  the  multi •« 
plier  ("E”  stands  for  "times  10  to  the")  which  results  when  these 
extents  are  replaced  by  100,  100#  50#  and  200  respectively* 


"Ident"  is  the  identifier  from  Figure  A. 4.  Flops  and  EM  accesses 
are  the  result  of  a hand  count  of  operations* 

"Special  Case”  is  the  column  for  notes.  The  only  special  cases 
noted  are  excess  divisions#  the  occurrence  of  the  SUMALL  global 
function#  and 

Note  1:  Many  of  the  variables  accessed  here  involve  triply  or 

quadruply  subscripted  array  elements*  The  progression  of  sub- 
scripts is  extremely  regular#  say  indexed  on  loop  variables 
and  by  literals*  It  is  assumed  that  the  compiler  or  the 
programmer  has  reduced  these  subscript  computations  to  not 
more  than  one  integer  multiply  per  accessed  element*  There 
are  several  ways  to  accomplish  this  simplification* 

Note  2;  In  reevaluating  BTRI  for  this  analysis,  a substan- 
tially higher  portion  of  the  floating  point  operations  were 
identified  as  PMAD  than  in  the  hand  compiling  that  led  to  the 
simulator  input*  A small  adjustment  was  made  on  account  of 
this  observation* 

The  notation  "(x3)”  or  "(x2)”  is  used  to  signify  that  there  are 
three  or  two  nodes  or  sections  in  the  branching  tree  (Figure  A* 4) 
with  identical  instruction  counts,  and  identical  number  of  times 
of  execution.  There  seemed  no  need  to  repeat  identical  entries  in 
the  table. 
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TABLE  A.l 


Characterization  of  Implicit  Code  Sections 


Multiplier 

Ident. 

75NJKL 

BTRIloop 

75E8 

25NJKL 

VISMATloop 

25E8 

NJKL 

STEPJKloop 

1E8 

STEPKLloop 

Flops/  EM  access/  Special 

Section  Section  Case 


STEPJLloop  117 
SPINJKL  6 
BCJKL  10 
RHSJKLoop  64 
RHSJLloop  64 
RHSKLloop  64 
VISRHSJKloop  210 
MUTURloop  533 
LUDEC  376 
VISMATloop 


533 

376(x3) 

224 


29 

25 

25 

4 

6 

17 
22 
22 
19 
99 

0(x3) 

18 


Note  1 


Note  1 


Note  2 


98.4% 
96. 


96.0% 


98.4% 


96. 

.0% 

98. 

.4% 

98. 

► 4% 

97. 

.4% 

98. 

.4% 

TABLE  A. 1 cont 


Characterization  of  implicit  Cc 


Multiplier 

Ident • 

Plops/ 

Section 

EM  acce 
Sectior 

5NJKL 

SMOOTHJKL 

190 

72 

5E8 

NJKL 

STEPSUM 

10 

5 

1E8 

NJK 

BCJK 

667 

80 

5E5 

MUTURJK 

170 

2 

1 

: VISMAT 

5 

0 

BTRI 

10 

0 

NJL 

BCJL 

12 

24 

2E6 

BTRI 

10 

0 

NKL 

BCKL 

33 

16 

1E6 

BTRI 

10 

0 

JKL 

1E6 

EIGENloop 

228 

56 

INITIAJKL 

6 

6 

i 

GRIDJKL 

3 

6 

) 38 

24 

11 

5 

JK 

1 

INITIAJK 

1 

3 

5E3 

JACOBJK 

0 

2 

JL 

■ 

JACOBJL 

0 j 

2 

2B4 

KL 

JACOBKL 

2 

3 

1E4 

PLANEKL 

121 

10 

- 
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Proc 

Util 


0.60 

99.97% 

SUM ALL  on 
10^  inst’s 

0.50 

99.97% 

2 DIV 

U04 

1.28 

1.28 

1.05 

96.0% 

2 DIV 

0.15 

1*05 

98.4% 

2 DIV 

0.49 

1.05 

96.0% 

1.06 
0.29 
0.154 
0.48  I 
0.52  ' 

97.4% 

IDIV 
No  flops 

0.088 

0.000 

96.0% 

No  flops 

0.000 

98.4% 

0.197 

96.0% 

1*18 


TABLE  A*1  continued 

Characterization  of  Implicit  Code  Sections 


Multiplier  Ident.  Plops/  EM  access/  Special 

Section  Section  Case 


3 


141 

3 


81 

3 

99 

5 

25 


1 

9 

139 


10(x2) 


5 DIV 
11  DIV 


0.26 

0.105 


0.133 

0.053 

9.6% 

0.0026 

0.0026 

0.0026 

0.0026 

0.0026 

0.19% 

0.0003 

0.0012 

0.0021 

0.19% 

0.21 

16.0% 

0.126 

38.4% 

Group 


Proc. 

Util. 


97.4% 

3593 

1E8 

99.9% 

190 

1E8 

96.0% 

852 

5E5 

98.4% 

22 

2E6 

96.0% 

43 

1E6 

97.4% 

314 

1E6 

96.0% 

1 

5E3 

98.4% 

0 

2E4 

96.0% 

123 

1E4 

19.2% 

7 

1E4 

9.6% 

3 

5S3 

0.19% 

210 

1E2 

0.19% 

149 

1 

16.0% 

20 

5E5 

38.4% 

43 

1E6 

JK 

JL 

KL 

NJ 

NK 

N 

1 

Subtotal 


NJK* 

NKL* 

Subtotal 


TOTAL 


Time 
(see. ) 


Throughput 


3583E8 

343.2 

190E8 

31.6 

4.3E0 

.398 

.44E8 

.083 

.43E8 

.048 

3.14E8 

.50 

3792E8 

375.8 

1.010 

.5E4 

.000078 

0E4 

.000117 

123E4 

.000078 

7E4 

.000269 

1.5E4 

.000024 

2.1E4 

.00794 

149E0 

.000057 

134E4 

.00856 

0.157 

10E6 

.046 

43E6 

.341 

50E6 

.387 

0.129 

3792E8 

376.2 

1.009 

1? 


A. 4 THROUGHPUT  OF  EXPLICIT  AERO  FLOW  CODE 
A*4*l  Summary 
A.4.1.X  Results 

A throughput  rate  of  0.89  gigaflops/ second  at  an  average  system 
processor  utilization  of  97.7  percent  is  estimated  for  the  Hung/ 
MacCormack  explicit  aero  flow  code.  This  estimate  is  based  on  an 
assumed  grid  size  of  100  x 100  x 100  elements  and  100  time  steps. 
A total  of  4.73  x 10^^  floating  point  arithmetic  operations  are 
executed  in  100  time  steps#  in  532  seconds.  An  extended  memory 
data  base  of  approximately  nine  million  words  is  also  required. 

A. 4. 1.2  Observations 

The  following  general  observations  were  made.  Some  of  these  show 
up  as  conclusions  in  Chapter  3. 


^ A direct  conversion  of  this  algorithm  into  extended  FMP 
FORTRAN  was  accomplished#  with  considerable  ease.  All  first 
and  third  level  subroutines  (19  of  30)  require  basically  no 
change. 

® The  ease  and  efficiency  of  translation  to  FMP  machine  code 
was  also  excellent.  A major  compiler  requirement  is 

minimization  of  address  indexing  operations  through 
recognition  of  common  subexpressions  which  are  abundant. 

® The  correct  algorithm  includes  a considerable  amount  of 
simple  moves  from  one  Extended  Memory  address  to  another. 
This  is  visible  in  routines  BCY#  PRSETY,  BCZ#  PRSET2  and 
OUTER. 

A. 4.2  Assumptions 

The  basic  formula  used  for  calculating  the  total  time  per  module 
was  transformed  tot 

Time  = Kl*#Plops  + K2*#EM  4*  K3*#Divs 

K1  = 295  nano  seconds  per  floating  point  operation  (flop) 

K2  « 1500  nano  seconds  per  EM  access  (#EM).  This  value 
includes  time  for  address  calculation. 

K3  “ 1460  per  divide  operation  (DIV)  in  excess  of  2 percent 
of  the  total  flop  count. 

This  approach  is  verified  for  the  explicit  code  through  detailed 
simulation  of  selected  typical  code  segments.  Subroutines  LX  and 
PX  were  selected  for  this  purpose.  This  data  is  included  in 
Figure  A.l. 
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The  algorithm  is  a tree  structured  set  of  thiry  subroutines 
on  three  levels.  Figure  A* 4 depicts  this  structure. 

All  level  two  routines  are  modified  to  employ  the  FMP  DOALL 
statement  in  place  of  the  current  dual  nested  DO 
statements.  The  level  one  main  program  initializes  GLOBAL 
variables  only.  All  level  three  routines  are  local  sub- 
routines (with  copies  resident  in  each  processor)  to  be 
executed  in  parallel  as  they  stand. 

® All  GLOBAL  values  and  simple  constants  are  stored  locally 
in  all  processors. 

® The  grid  size  chosen  for  analysis  was  100  x 100  x 100. 

^•^♦3  Method  of  Analysis 

The  initial  phase  of  investigation  was  a review  of  available 
background  material.  The  Navier-Stokes  equations  are  the 
essential  mathematical  model  of  the  dynamics  of  a compressible- 
fluid  flow.  Reference  [A.l]  provides  the  description  of  an 
explicit  discrete  mathematical  algorithm  for  solving  these 
equations.  NASA  supplied  this  methodology,  and  the  FORTRAN 
listing  of  the  resulting  program.  Figure  A. 5 shows  this  program’s 
structure.  This  information  was  then  synthesized  into  Figure  A.6f 
a list  of  subroutine  groups  and  the  identification  of  the 
program’s  major  outer  loop.  Further  detailed  analysis  of  individ- 
ual code  segments  determined  the  number  of  static  calls  on  each 
subroutine. 

The  next  analysis  was  the  identif ication  of  major  data  classes. 
The  following  standard  FMP  classes  were  identified: 

(a)  Nine  three  dimensional  shared  arrays  from  the  data 

base.  These  arrays  are  in  common  and  are  STRUCTURE 
variables  in  Extended  Memory.  This  data  is  accessed  via 
three  dimensional  subscripts  representing  mesh  points. 

(b)  System  wide  scalar  variables  (GLOBAL  variables).  These 
variables  are  replicated  in  all  processor  local  memories. 

(c)  Local  common  in  processor  memory.  No  communication  of 
this  data  between  processors  is  required. 
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Figure 


MAIN^ 


Main  loop  starts 


End  of  main  loop 


RE ADI 0 
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WALL 
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. , BCZ 
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.5  Calling  Tree  of  Explicit  Aero  Flow  Code  and  Segments 
for  Analysis 
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MAIN 

Start  of  main  loop 


End  of  main  loop 


Initialization  routines  (run  once) 


-IiX 
' LY 
► L2 
■ LYC 
LZC 
LYI 
LZI 

SBCINT  (several  calls) 

TIMSTP 

TURBDA 

PRTFLO 

Termination  output  (run  once) 


Figure  A. 6 Summary  of  Calling  Tree  for  Explicit  Code 


(d)  STRUCTURE  data  wherein  an  array  of  elements , one  per 
mesh  point,  may  be  kept  in  individual  processors. 
Subroutine  SBCINT  contains  a three  dimensional  array 
"SBC**  of  this  type. 

(e)  Strictly  local  temporary  data  for  a single  subroutine  or 
DOALL  block. 

(f)  Nameless  temporary  working  store  typically  required  in 
expression  evaluations.  Ihese  are  typically  assigned  to 
processor  registers  by  the  compiler. 

Except  for  type  (d) , examples  are  visible  in  the  FMP  FORTRAN 
version  of  subroutines  LX  and  FX  (Figures  A. 7 and  A. 8). 

The  next  step  in  the  analysis  was  a survey  of  all  subroutines  to 
identify  the  DOALL  statements.  No  such  statements  are  required  in 
the  main  procedure  or  in  any  level  three  subroutine,  which  are  all 
local  to  the  instances  of  the  DOALL* s.  All  level  two  subroutines 
contain  dual  nested  DO  loops,  which  are  directly  converted  to  a 
DOALL  with  10,000  instances.  The  LX  subroutine  provides  a typical 
example.  (See  Figure  A. 7)  Note  in  the  listing  of  LX  (in  Figure 
A. 7),  that  the  DOALL  begins  at  line  102000  and  ends  at  line 

107600.  Thus,  almost  all  of  LX  consists  of  10,000  instances  of 
this  code  (and  the  call  to  FX  at  line  104500  in  each  instance). 
Thus,  twenty  cycles  are  required  of  each  level  two  and  three 
subroutine  to  execute  the  10,000  instances  giving  a processor 
utilization  of  97.7  percent. 

The  initialization  routines  MESH  and  WALL,  being  executed  only 
once,  were  ignored*  The  output  routine  PRTFLN  was  also  ignored. 
The  next  phase  consisted  of  counting  floating  point  arithmetic 

operations,  floating  point  divide  operations  and  Extended  Memory 
accesses  in  the  subroutines. 

This  count  includes  the  effects  of  DO  loops,  fine  or  coarse  grid 
partial  subscript  range  values  and  the  program's  branching 
structure.  This  information  is  given  in  the  various  columns  of 
Tables  A.  3 and  A. 4.  The  product  of  these  counts  then  produced  a 
total  count  of  operations  per  subroutine.  The  application  of  the 
formula;  time  = Kl*#Flops+K2*#EM+K3*#  Divs,  then  gave  a total 

execution  time  per  program  module*  The  total  number  of  flops  was 

also  given  by  the  product  of  the  number  of  flops  per  module  times 
the  number  of  active  modules.  The  system  throughput  rate  Tp  was 
then  compared  to: 

ZFlops 

Tp  = 2^Time 


which  is  an  average  flops/second  rate  value. 
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iOxOOO 

iOliOO 
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x0i300 
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iOiSOO 

ioieoo 

J.0J.700 

i017i0 

iOiSOO 

iocooo 

i02JL00 

102300 

102300 

102400 

102500 

102600 

102700 

102300 

102900 

103000 

103100 

103200 

103300 

103400 

103500 

103600 

103700 

103300 

103310 

103820 


SUBROUTINE  LX 
LX  UPERRTOR 

CDMMON/rtli/  RHO<10.0 ,10  0,100  ) ,RHau<lOO  , 100,100  ),  RHDM<100 , 100 , 100  ) 
CQMM0»/ftl2/  RHOWClOO  , iO  0, 10  0), Ev  10  0,10  0, 100), EKiOO,  10  0,100) 
COMMOM/fliS/  UClOO , 100 ,100) ,V< 100, 100, 100) , mio 0,100, 100) 
CDnMDN/fll4/  F<2,5) 

CDMM0N/R2/  PRDICTdOi,  5)  , P<101) 

Ca»MON/ft3/  VC100>,DYCELL<100),US1, JEl, JS2, JE2, JLFM, JL,vr, VH 
1 ,2<100  >,OZCELU<i00),KSi,KEi,KS2,KE2,KLFM,kU,2F,ZH 

CQMMUN/ft4/  JSHK, ILE,IE,XL,K1,K2,k3,K4,K5 

CDHMOW/ft5/  (sAMMft,  GflMMl,  6AMMFR  , CV  , CUl , STOKES  , UU  , CU  , FO  , RHOU  , RL  , XO 
COMMON/ft?/  DX, DXl, DY, DYl, 02,  021 , E I WALL , I fiOBWU  , DT , CFL , CONST 

DOMAIN  /EXPLCT/;  I =1 , iOO  ; Js  , iO  0 ; K^i  , 10  0 
OroXsOTADXl 

DOALL  U=JS1, JE2;k=KS1,KE25  USING  / All/ , /rtic/ , / Al3/ , /fll4/ , / a2/ , / A4/ 
DO  3 1*^1,  I L 
RRDICT(I,1)=RH0 
PRDICT< I, 2;aRHOUv I , J, K; 

PRDICTC I , 3)sRHOU< I , J, K ^ 

PftOICT<I,4>=RHOW<I,J,K/ 

PROtCTCI,5)=:E  <1,  J,K; 

P ( I ) j:(5AMM1  ARHOC  I,J,K;wEI<I,J,K^ 

3 CONTINUE 
00  4 Hi:i,2 

IJSI 

IADD:sN-l 
NMisN-i 
Bsi, /N 
i IsI+IADD 
UII=UCll,J,K> 

CALL  FX(UI  I,  I, J,K,  X J ) 

DO  5 Is2,  IE 
k3sK1 
Kx=K2 
K2  = K3 


Figure  A. 7 FMP  FORTRAN  Version  of  LX 


A-28 


I 


103900  llsI+lflOD 

104000  U2 I=U< I I, J, 

104100  UllJsU<  I+X,  K) 

104E0O  UlScU<I,J,K> 

104300  IF<UIl,GT,Ul£,AND,<3,xUIi-UlE)A<3,3«cU:£-UIi),LT,0,>  UII  = ,5x(UIi  + Ul£ 

104400  X } 

104500  CALL  rx(ui I , I , J, K, * I ; 

10  4600  PRDICTC  I,  l)  = <;NMlJ«cPftDICT(  1 , 1>+RHD  < I , J,  K >-DT0XX<F<K£,  1 ^ -F  <Ki  , 1 ^ ) XB 

104700  PRDICT<I,  £>  = <NM1XPRDICT<I,£)+RHDU(I,  J,K/-DTDX>k<F(K£,£)-F<kl,£>>)j»cB 

10  4800  PRDICT<I,3>:s<NMiXPRDlCTCl,3)+RHDV<I,  J,K^-OTDX)«c<F<K£,3)-F(K1,3)>>AB 

10  4900  PFDICT<  I,  4>  = <NM1APR0ICTC  I,4>+RHDW(I,  J,K>-OTDXX(F(K2,4)-F(Kl,4)))>itB 

AO 50 00  PRD1CT<I,5>  = <NM1XPRDICT<I,5>+E  < I , J , K ; -OTOXX < F < K£ , 5 > -F < Ki , 5 > ) ) XB 

105100  5 CONTINUE 

105EOO  C 

105300  C X DECODE  x 

105400  DD  6 I~£, IE 

105500  RHOIsl, /PRDICTC I , i> 

105600  U < I,  J,K;:sf>ROICTi:  ;,£>XRHDI 

105700  U < I, J, K>=PROICT< I, 3>XRHOI 

105800  U <I,U,K;uPRDICT< I,4)XRH0I 

10  590  0 EI<t,  J,K;s:PRDlCT<  I,5)X  RHOI  - . 5x  < U < I , J , K > XX£  + V < I , J , K > XX£  + W < I 

106000  X ,J,K>XX2> 

106100  P(l>  a6RMHiXPRDXCT<  I , DXEI  < I ,U,  K> 

106200  6 CONTINUE 

106300  CXXX  XOOMNSTRERM  B,  C,  ft7  IsZL 
106400  DO  9 K6«l,5 

106500  9 PRDICT<  IL,  k6>::PRD1CT(  IE,  K6> 

106600  CALL  BCY(K, E, IE, J, J> 

106700  4 CONTINUE 

106800  DD  7 I=E,IL  ,.yY  QF  ^11^^ 

106900  RHO  (I,  J,K>:=PRDlCT<I,i;  tn 

107000  RHOU<  r , U,  K>=PROICT<  I , £)  ' ^ 

107100  RHDV< I , J, k;=PROICT( I , 3) 

) 107E00  RHDWd,  J,K^=PRDICT<I,4p 

107300  E < I , J, K^sPRDICT< I , 5> 

10740C  7 CONTINUE 

, 107600  ENOOD;  SlUINQ/All/, /AIE/, /A13/, /AE/ 

j 107700  CALL  OUTER( JSl, JEE, KS1,KEE) 

» 107800  RETURN 

107900  END 


Figure  A. 7 FMP  FORTRAN  Version  of  LX  (Cont'd) 
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.iO  0 X 0 0 c 
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101500 

101600 

101700 
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101900 

10£000 

loeioo 

10S200 

102300 

102400 

102500 

102600 

102700 

102800 

102900 

103000 

105100 

103200 

103300 

103400 


SUBRDUTINE  rx(UI 1 , I , J, K,  i I > 

X TRANSPORT  AND  ITRESS  IN  X-DIRECTION 

COHMDN/ail/  RHa(i0  0,l00,i00),RHDU< 10 0,100, 100), RHDV<1U0 , lOO , 10  0 > 
CDHMON/fllE/  RHOW (iUO, 100, 100),E( 100,100, 100), EK 100, 100, i00> 
CDMMUN/A13/  uCluO, 100, 100), 100,100, 10 0),W(100, 10 0,100) 
COMMDN/fti4/  F<£,5> 

CDHM0N/ft2/  PRDICT<lOi, 5) , P<101> 

CDHMDN/A3/  YClOO  ) , DVCELLdOO  ) , JSl,  J31,  J32,  JE2,  JLFM,  JL,  VF,  VH 
1 , ZCiOO ), 02CELL<iOO ) , KSi , KEl , K S£ , KE2 , KLFM,KU,ZF,2H 

CDf1MDN/ft4/  ISHK,  ILE,  IE,  lU,  Kl,  K2,  K3,  k4,  k5 

CDriMUN/A5/  6ftHMft,eAMMl,QAMMPR,CV,CVi, STOKES, UO , C0,P0,RHD0»  <L, XO 
CDHMaN/ft6/  RMUL(;i00,i00,i00) 

C0HM0N/fl61/  RMU, RK, RLMBDA 

COHMON/A?/  DX, DXi,0Y,DYl,02,  D21, EIWALt , lADBWL  , OT , CFL , CONST 
CDMMDN/A8/  I SMTHX , I SMTH Y , I SMTH2 , LYICNT,  LYCCNT,  LZCCNT,  LZICNT, 
1 NLYI , NL2I , BETA, BETAl, CRKNIS 

CDI1MDN/ANQL/  TANT(l01),CaST<101), T ANTH , TANTHB , COSTH , COSTS A , SECTH 
CDMMnN/VISCnU/SI6X,SXSY, SI62,TAUXY,TAUX2, TAUYZ, DISX,DISY, DIS2, 

X UYX,VYX,WYX 

RMU«RMUL( I I , J, K; 

RK  sQAMMPR:«tRMU 
RLMBDAsSTOKESJkRMU 
DYl-l.x  <Y( Jfx>-YC J-i) > 

D2i:2l,  /<Z<k  + l)-2(K-l)  ) 

DYX=.5:«c<TANT(  n+TANT<l  + l)  )AOYl 


Figure  A, 8 FMP  FORTRAN  Version  of  PX 
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103500 
j.03600 
103700 
103800 
103900 
i0^000 
104100 
104S00 
104300 
104400 
104500 
104600 
104700 
104800 
104900 
105000  C 
105100  C 
105400 
105500 
105600 
105700 
105800 
105900 
106000 
106100 
<• 


UYX  = U<I  I,  + I,  J“1,K; 

MY>C  = V<  II,J*rl,K>-V(II,  J-i, 

SI6XSP  I I p -<RLMBDfl  + £,  ycRMU  <U<I+1,J,K;-UvI,J,KJ  )xDX1-UYXkDYX  } 

X ~RLMBOftX(MYXwOYl  + <W<  II,J,Kt1>-*W<II,J,K-1)>X021) 

TfiUXYs-RMUK < UYXXDY1+ ( V< I+l,J,k>-yCl,J,K;>XDXl-VYXXDYX; 

7flUXZs-RMUx( <U< I I , J, Kt1>-U< II , J, K-i) )XD2l+<W< I+l, J, K^-W< I , J, K>>X 
X 0X1-<W(  II  , Jrl,  K^-W<  II,  J-l,K'))xDYX; 

DISXsSISXXUI I+TftUXYXy( I I , J, K J7TRUXZAW< I i,j,k>-rkx<<ei<i  + i,j,k;-ei< 
XI, j,k;)adxi-Cei (I  I, J + 1,K^-EI<II, j-1,k;>adyx; 
r(K£,i>=PROICT<  II  ,1MUII 
F(K2,E)-PRDICT< II,£)AUI I+SI6X 
F(kE,3:»=:PRDICT(  II  ,3)AUII  + TftUXY 
F(K2, 4>«PROICT< I I , 4 > AU I I +TRUXZ 
F(K2, 5)aPRDICT< I I , 5) AUI I+DISX 

IF<  ISMTHX,  E(S,  0 .UR,  I,UE,i  .OR,  I.OE.IE)  RETURN 
X SMOOTHING  TERMS  X 

COEFcCONSTAABS<P< ll+l)-e. AP<II)+P(II-l>)/<ftBS<P<II+l)>+ 

A 2,XRBS<P<I1 ))+ABS(P(lI-l))> 

Cl I = SGRT<6flMMflA6RMMlAABS<EI (I  I , U, K) ) > 

CaEFcCaEFA<ABS(U< II,J,K/)+CII) 

DO  9 K6«l,5 

9 F<KE,K6)=F<KE,K6)-C0EFA<PRDICT<I+1, K6>-PROICT< I, K6>  > 

RETURN 

END 


Figure  A. 8 FMP  FORTRAN  Version  of  FX  (Cont’d) 
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Total  Execution  Time  = T.  .32E08  |ts 

Total  Flops  - 4.73E11 

Throughput  = 0,85  Gigaflops/Second 


Table  A*4 


Throughput  Oomputations 

for  Explicit  Code 

PARAMETERS 

100  X 

100 

X 100  GRID 

SIZE 

100 

TIME 

STEPS 

ROUTINE 

TOTAL  TIME  - US 

TOTAL  FLOPS 

THROUGHPUT 

LX 

3.30E07 

1,98E10 

0.60 

FX 

5*75E07 

3.60E10 

0.63 

LY 

4,02E07 

3.13E10 

0.78 

LYC 

1.16E07 

9.83E09 

0.85 

LYI 

2*74E07 

2.46E10 

0.95 

LZ 

3.23E07 

2,33E10 

0.72 

LZC 

1,14E07 

8,62E09 

0,76 

LZI 

2.61E07 

2*02E10 

0.77 

SQRT 

1,47E07 

2,08E10 

1.41 

CHARAC 

1.07E08 

1,45E11 

1.36 

DIAGON 

3,54E07 

S^OOEIO 

1.41 

TRIDATA 

2.88E07 

3,64E10 

1.26 

PRSETY 

2,55E07 

5,86E09 

0.23 

PRSETZ 

2.55E07 

5,86E09 

0.23 

GI 

1.59E07 

1,08E10 

0.68 

HI 

1.47E07 

6.60E09 

0,45 

ADDG 

6 • 66E06 

9.40E09 

1.41 

JCLMN 

1.28E06 

1*00E06 

0 

BCY 

5.04E05 

0 

0 

BCZ 

5.04E05 

0 

0 

OUTER 

8,40E05 

0 

0 

SBCINT 

TIMESTP 

TURBDA 

PRTFLW 

0 

1,I5E07 

3*75E06 

0 

7.39E09 

1.60E09 

0 

0.64 

0.43 

5.32E08  us 

4.73E11 

.89  Gflops/ 

sec 
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A#4.4  Simulation  and  Hand  Compiling 


A validation  of  the  above  analysis  was  conducted  by  simulated 
execution  of  a typical  code  section*  The  PMP  simulator  is 
described  in  Chapter  7# 

The  main  stream  subroutines  encompassing  the  bulk  of  execution 
time  were  LX,  LY,  LZ,  LYC,  LZC,  LAX,  LYI,  LZI  and  their  associated 
third  level  subroutines*  The  subroutine  LX  and  its  associated 
level  three  subroutine  FX  were  selected  as  representative  of  this 
algorithm*  LX  and  FX  are  shown  in  Figures  A.  7 and  A.  8 respec- 
tively. Figures  A. 9 and  A*10  show  the  FMP  FORTRAN  versions  of 
TURBDA  and  OUTER  v;hich  were  also  simulated.  No  special  handling 
was  needed  on  these  subroutines*  Each  is  a demonstration  of  a 
simple  conversion  of  nested  DO  loops  to  a DOALL  construct. 

The  initial  effort  in  preparing  the  simulator  input  was  the 

revision  of  the  original  FORTRAN  code  sections  into  the  extended 
PMP  FORTRAN  language*  Modifications  were  primarily  in  the  areas 
of  data  declarations,  domain  declarations,  and  DOALLs*  Assignment 
to  GLOBAL  variables  was  assumed  done  in  parallel  across  all 
processors.  In  addition  to  these  changes,  the  code  was  reviewed 
for  areas  in  which  an  optimizing  compiler  could  be  expected  to 
achieve  time  savings.  These  changes  typically  take  the  form  of  a 
new  local  temporary  variable  holding  the  evaluation  of  a common 

subexpression  in  order  to  improve  performance* 

In  particular,  common  subscript  expressions  were  detected  and 
evaluated  separately  during  both  the  hand  analysis  and  the  simul- 
ations* These  expressions  all  involve  the  integer  mode  sum  or 

products  used  to  compute  an  address  from  the  subscript  values. 
Although  a mature,  optimizing  compiler  will  find  such  common 
subscript  expressions  and  combine  the  results  transparently  to  the 
user,  the  hand  analysis  performed  this  level  of  optimization  by 
hand . 

For  example  in  the  subroutine  FX,  25  three  dimensional  subscript 

expressions  may  be  reduced  to  seven  common  expressions*  Other 

changes  such  as  the  use  of  an  iterated  DO  Loop  rather  than 
straight  line  code  were  made  to  reduce  the  size  of  the  generated 
machine  code  file. 
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SUBROUTINE  TURBDR<CU) 

COMMON/rtll/RHO ( i0Uil00,lC0;jRHOU(l00,i00,100), 

RHOUCIOO , 100 , 100 > 

CDMMDrJ/fll2/  RHPW«vlOO,lUO,iOO>,E(iOO,iOO,iOO>,EI<100,iOO,iOO> 
C0MH0N/R13/  . U( 100 , 100 , 100  > , W iOO , iOO , 100  > , W(iOO , 10  0 , 100  > 
COHHON/Am/  K<2%5; 

COMMtlN/rt£/  RRDICT<l0i^5>,R(.l01) 

C0MMDN/R3/  Y(iOOi,DYCELLi;iOO),JSi,  JEl,  US2,  JE£,  JLTM,  JL,  iT  , YH 
, 2(100 ) , 02CELU.  10  0 > , KSi, KEl, KS£, KE£, KLTM, KL, Zr, ZH 
CDMMON/R4/  ISHK,  ILE,  IE,  I L , K 1 , k£  , k3  , K»*  , k5 

CDMMON/ftB/QRMMR, SflMMl, GRMMPR, CU, CUi , STOKES , UO ,C0,P0*RHO0,RL,X0 
CDMHDN/H6/  RMUU(iOO , 100 , 100  ) 

DOMAIN  /EXPLCT/  } I =1 , lO  0 *,  Jwl , 10  0 i Ks=l , 10  0 
INRUL/EXPLCT / TEMP 
CUl  « 1*0/CU 

ODAIL  J12JS1,  JE2;  K = K£1>  KE2;  using  /ftlc/,/R5/ 

00  1 tsl, ZL 

IF  (K.EG.X)  TEMPa'J  , 5aR0S(EI  ( I,  J,i;+EI  < I , J,  £>  >MtCUl 
ELSE  IF(  J,  EG,  IpTEMPaO  , 5>‘RBS(E1  < t , 1,  K;+E  K I , £,  K>  )ACU1 
ELSE  TEMPt:RBS<EI  < I , J,  K;  MCVl 
ENDir 

RMUL<I,U,K;  :s  2 , £70  E -0  3j»tSGRT  ( TEMPR*3  > /TEMP + 198 , 6 > 

CONTINUE 

ENDOa;  GIVING  /R6/ 

RETURN 

END 


Figure  A. 9 PMP  FORTRAN  Version  of  TURBDA 


iCO 

130 
135 
139  C 

mo 
150 
100 
170 
1^0 
190 
£00 
£05 
£09  C 
210 
£20 
£30 
2H0 
£50 
260 
£65 
270  3 
£75  C 
£80 
£90 
300 
310 
320 
330 
340 
350 
360 
370 


SUBROUTINE  DUTER< JS, JE, KS, KE^ 

CDMMOH/rtii/  RHO(iM0,*00,iu0>,RHDU<.iU0,10  0,i00>,  RHOVt  100  , i 
COMMON/R12/  RHONUOO  « XOO,10  0>«E<100,100,100>^£U10  0«100« 
CDMMDN/H13/  U(iu0tl0  0«100>,u<l00«100,iu0;^yiulu0yl00yl00; 
COMMUN/ft3/  V (100  ) , D'l  CCLUi,  *00  ) , JSi  , UEi,  JSu,JE£,JLrM,JU,Yr, 
1 ,Z(100), P2CELLV 100 ; .KSi,KEi, rs£, KE2, KUrM, KL, Zr, 

COMMOM/R4/  iSHK, iLE, IC, * t , K1 , k£, K3, K4, K5 

DDUNSTRERM  RT  IsJL 

DDRUL  KttKS, KE ; JUJS, JE  \ US  I Hfi/Rii/, /Ri£/ , /r3/ , /AM/ 
RHO<IL,U,K^  « RHO(IE,J,Ky 
RHUU».  XL,  J,  u RHDU(ie,J,KM 
RHDUUL,J,K^  :s  RHDMOE,vI,K^ 

RHOM(IL,J,K^  u RHDH(IE,J,K^ 

E(XL,J,K;  s;  E^,IE,>/,^C; 

ENDDO  ; QIVINS  /Ail/,/Al£/ 

IF  (UE.LT.JEu;  GO  TO  3 
upper  B.  C,  HT  JuJL 

DDALL  K«KS, KE; I»2,  IE  ; U S I HQ/ All /, /Al£ /, /a3/ ,/ AH / 
RH0(I,JK,K;  ts  RHDvI,UE£jKJ 
KHOU(iyJK)K^  u RHDUk X , JE2, K ; 

RHOU(I,JK,K;  c RHDUd,  JE£,K^ 

RHOM(I,UK,K^  « RHOmi,JE2,K> 

E<I,JK,K;  s E(I,JEu,K> 

ENDOOX  GIVING  /All/,/Al£y 
IF  (K,QE,KE2>  then 
EDGE  B.C,  AT  KnKL 

DDALL  JE} ls£, IE  ; USING  / Ail/ , /Al£/ , /A3/ , /AH/ 

RHD(I,J,KL>  :=  RHU(I,J,KE2; 

RHOU<I,J,KL;  i:  RHDU  ( I , J , KE2  > 

RHOM(I,J,KL/  » RHOV<.  1 , U,  KE2> 

PHOW(I,J,kLj  s RHO$^(  I , U,  KE2> 

E(1,J,KL;  « EU,J,KE2.j 


ENDOOl  GIVING  /Ail/,/Al£/ 
EHDIF 
RETURN 
END 


j^lMJDDCIBaOT 

^WAl.  W6B  18  WOR 


Figure  A. 10  FMP  FORTRAN  Version  of  OUTER 
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The  next  phase  was  hand  compiling.  It  was  assumed  that  the  first 
four  registers  in  all  groups  were  designated  scratch  registers. 
They  were  also  used  for  passing  parameters  and  results  to  and  from 
subroutines.  The  remaining  registers  were  employed  for  longer 
lifetime  storage  requirements. 

Experience  during  the  hand  compilation  demonstrated  that  the  need 
for  integer  registers  exceeded  the  supply.  As  a result,  storing 
and  restoring  of  these  registers  had  to  be  employed. 

In  subroutine  SBCINT  and  JCLMN  a non-standard  approach  was 
assumed.  Both  routines  have  a very  minor  impact  on  total  through- 
put. The  SBCINT  routine  performs  a clearing  operation  on  the 
three  dimensional  array  SBC  and  is  called  four  times.  SBC  was 
declared  to  be  an  INALL  array.  A single  statement,  SBC=0,  there- 
fore clears  it.  The  routine  JCIjMN  is  called  from  LYC  and  LZC 
subroutines  outside  of  their  DOALLs.  Although  the  routine  could 
be  programmed  using  recurrence/  there  seems  to  be  no  advantage. 
This  routine  is,  therefore,  assumed  to  be  executed  serially. 

A. 5 GISS  CLIMATE  PERFORMANCE  EVALUATION 


A. 5.1  Summary 


The  evaluation  described  below  was  done  on  an  intermediate  size 
(2"*  latitude  steps,  2.5®  longitude  increments  along  the  equator) 
weather  program.  The  program  consists  of  an  easily  vectorizable 
fluid  dynamics  section  (subroutines  COMPl  and  C0MP2  and  the  sub- 
routines they  in  turn  call),  and  a hard-to-vectorize  physics  and 
chemistry  section  (C0MP3  and  its  subroutines).  The  average 
throughput  for  the  entire  program  v^as  determined  to  be  0,532 
Gf lops/sec.  The  time  for  a 14-day  simulation  with  20  minute  time 
steps  was  projected  to  be  4 minutes,  25  seconds. 


A GISS  weather  demonstrated  the  advantages  of  the  FMP  architecture 
over  that  of  a vector  machine.  The  vectorizable  portions  of  the 
program  tended  to  run  slow  because  of  many  EM  accesses,  but  the 
unvectorized  portion  of  the  program,  namely  COMP3  and  its 
subroutines,  ran  at  1.2  Gf lops/sec  for  the  portion  simulated. 


A. 5.2  Discussion  of  the  Analysis 

The  following  versions  of  the  weather  model  codes  were  provided  by 
NASA  as  input  for  selecting  an  FMP  benchmark  test. 
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GISS  Models 


A.  360/65  version 

B.  360/195  version 

C.  STAR  100  version 

D.  ILLIAC  IV  version 

The  various  versions  are  machine  dependent  versions  of  the  Mints— 
Arakawa  differencing  scheme  which  numerically  solve  the  differen- 
tial equations  representing  the  physical  dynamics  of  v;eather 
conditions.  Reference  [2]  describes  this  methodology. 

The  basic  database  for  the  GISS  model  is  a series  of  three  dimen- 
sional arrays.  llie  data  values  in  individual  arrays  represent 
temperature,  pressure,  humidity,  etc.  at  each  point  of  the  assumed 
latitude,  longitude  and  altitude  grid.  Arrays  of  one  and  two 
dimensions  are  also  utilized  in  various  code  sections  in  addition 
to  various  simple  scalar  values. 

Minor  variations  in  GISS  versions  exist  due  to  selecting  different 
granularities  in  grid  size,  time  step,  split  grid,  step  size,  I/O 
management  and  the  nature  of  the  Host  machine  architecture 
(Scalar/360,  Vector/Star  Array/ILLIAC)  considerations.  Grid 
sizes  vary  from  a coarse  (25,  40,  2)  to  a superfine  (180,  288,  9) 
as  indicated  in  Table  2.1  of  reference  [IJ.  The  historical 
increase  in  computing  power  has  provided  the  facilities  for  includ- 
ing the  larger  grid  sizes  and  smaller  time  steps  and  thereby 
improving  the  accuracy  of  results. 

A medium  sized  grid  of  (89,  144,  9)  was  selected  for  FMP  Benchmark 
purposes.  This  size  is  a valid  test  of  the  system's  dexterity, 
although  a larger  size  would  probably  enable  higher  system  effic- 
iencies and  simple  program  conversion  to  the  512  processor  FMP 
system.  The  360/195  non-split  GISS  version  was  used  as  the  basic 
FMP  benchmark  model. 

Simulation  of  code  running  on  the  FMP  system  is  necessarily  limit- 
ed by  time  and  cost.  These  requirements  necessitate  the  separa- 
tions of  the  GISS  model  into  low  and  high  use  frequency  classes  in 
order  to  expedite  the  analysis.  Routines  of  low  frequency 
(once/run)  and  therefore  considered  of  null  impact  were: 

INPUT 

GMP 

SDET 
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Routines  of  high  frequency  and  therefore  maximum  impact  were: 

COMF1  - AVRX 
C0MP2  - AVRS 
C0MP3  - OZONE 

- SOLAR 

- LINKHO  - SORT 

- EXP 

AVRX  is  an  extremely  frequently  used  subroutine  and  presents  an 
interesting  opportunity  for  optimizing  FMP  performance.  The 
function  of  AVRX  is  as  follows.  First,  for  every  latitude  J, 
compute  a number  NJ(J)  (also  called  DRAT  and  FNM) . *Uien,  for  each 
point  J,I  (I  is  the  longitude  index)  perform  a smoothing  function 


New  PU(J,I)  = S(01d  PU(I,J-1),  Old  PU(I,J),  Old  PU(I,J+1)) 

over  all  values  of  I.  Then  update  Old  PU  = New  PU  at  all  values 
of  J,  I.  For  any  given  latitude  J,  do  the  smoothing  NM(J)  times. 
NM(J)  is  a non-decreasing  function  of  distance  away  from  the 
equator,  although  this  fact  is  not  used  in  the  original  program. 
Several  methods  of  converting  this  subroutine  into  FMP  FORTRAN  are 
discussed  below. 

1.  A DOALL  on  J,  with  the  programming  over  I and  N serial  inside. 
89  out  of  512  processors  have  instances,  and  the  longest 
instances  occur  at  the  poles  where  NM  has  its  maximum  value. 

2*  An  outer  loop  on  N,  iterating  the  number  of  times  given  by  the 
maximum  value  of  NM  at  the  poles.  Inside,  a DOALL  over  both  J 
and  I allows  all  processors  to  execute  on  the  first  iteration 
of  the  N loop,  but  as  the  successive  iterations  of  the  N loop 
occur,  those  instances  which  test  and  find  that  N.GT.NM(J) 
exit  without  performing  any  work.  At  the  end  of  the  N loop, 
only  those  instances  which  lie  at  the  poles  are  doing  any 
work;  the  others  are  idle. 

3.  Like  2,  except  that  when  computing  NM(J),  the  smallest  J is 
computed  (nearest  the  equator)  for  which  the  given  value  of  NM 
occurred,  giving  JL(NM)  as  a GLOBAL  array.  The  program 
structure  would  look  like; 

DOALL  J=l,  JMAX 

NM(J)  = arithmetic  expression 

JLl(NM)  - smallest  J in  northern  hemisphere  for  which  NM  has 
the  value  shown  in  the  subscript 
JL2(NM)  = largest  value  of  J in  southern  hemisphere  for  which 
NM  has  the  value  shown  in  the  subscript 


ENDDO 


DO  1 N=  1,  NM(poles)  % N loop 

DOALL  JL2(N);  I=1,IMAX  % Ail  points  needing  smoothing 

% in  the  Southern  hemisphere 

PU  - S(01d  PU  values) 

ENDDO 
1 CONTINUE 

Method  3 avoids  the  creation  of  instances  that  do  no  work,  and 
hence  enhances  processor  utilization.  Even  though  the  last  few 
iterations  on  N have  only  144  instances,  since  JLl  will  equal  JMAX 
for  large  N,  and  JL2  will  equal  1 for  large  N,  the  average  number 
of  processor  busy  would  be  substantially  better  than  that  for 
either  method  1 or  method  2.  The  cost  is  increased  overhead  at 
the  beginning  of  the  DOADLs* 

4.  Method  2 can  be  modified  as  follows*  First,  the  DOALL  on  I 

and  J can  be  replaced  by  DOALL  J=l,  89;  11=1,109,36.  Inside 
the  DOALL,  a loop,  DO  M=l,36  is  added  and  the  subscript  I is 
set  equal  to  II+M.  result  is  that  36  neighboring  values 

of  I are  computed  within  a single  processor,  and  the  same  old 
value  of  PU  can  be  fetched  once  from  EM  for  all  three  uses 
within  the  smoothing  function.  The  result  is  a decrease  in 
the  number  of  required  EM  accesses  by  almost  a factor  of 
three,  while  processor  utilization  is  reasonably  good  (356 
processors  out  of  512,  for  the  particular  example). 

5.  With  even  more  complexity  in  the  management  of  the  mapping 
between  domain  variables  and  I,J,  one  can  have  36  values 
of  I per  instance,  and  keep  494  processors  busy. 

Time  precluded  simulating  any  more  than  one  of  the  above  options. 
Option  5 was  selected  for  simulation,  and  produced  the  result 
shown  in  the  table.  One  of  the  reasons  for  selecting  option  5 is 
that  the  remapping  of  the  values  of  I,J  Into  particular  processor 
might  be  done,  not  explicitly  by  the  programmer  as  shown  in 
example  4,  but  by  having  the  compiler  map  particular  instances 
into  particular  processors.  In  the  prototype  compiler,  it  is 
expected  that  the  assignment  of  instance  number  to  processor  will 
be  fixed  at  processor  number  equal  to  instance  number  modulo  512. 
Future  compiler  enhancements  could  include  statements  that  allow 
the  programmer  to  specify  how  instance  numbers  map  onto 
processors.  The  simulation,  to  some  extent,  was  an  investigation 
of  the  value  of  such  mappings. 

After  AVRX,  the  body  of  COMPl  and  COMP2  are  the  next  most 
frequently  used.  With  minor  exceptions,  they  have  common  coding 
characteristics.  They  were: 

Heavy  use  of  Extended  Memory 
*-  Heavy  use  of  three  dimensional  indexing 
- Low  number  of  floating  operat ions/access 
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The  initial  section  of  C0MP2  was  judged  to  be  typical  and  was 
therefore  simulated  on  the  instruction  timing  simulator# 

CQMP3  is  executed  once  for  every  NC0MP3  executions  of  COMF1  and 
C0f>lP2.  The  radiation  routines,  LINKHO,  etc*,  are  called  every 
NHOGAN  times  that  C0MP3  is  called  once*  Values  of  three,  and  five 
respectively  were  used  for  NC0MP3  and  NHOGAN*  In  C0MP3  and  its 
subroutines,  computations  are  carried  on  along  the  vertical  direc- 
tion, making  each  latitude- longitude  point  independent  of  any 
other.  Thus  CCHP3  partitions  into  a set  of  independent  instances, 
each  having  a specific  location  on  the  earth’s  surface*  C0MP3  and 
its  subroutines  are  characterized  by: 

- Minimum  use  of  Extended  Memory 

- Simple  parallel  partitioning 

- High  number  of  floating  point  operations 

- Low  number  of  indexing  operations 

- Data  Dependent  branching 

The  two  maximum  frequency  inner  loops  of  the  LINKHO  subroutine 
were  judged  typical  of  this  code-section  and  simulated  in  detail. 

The  routines  actually  simulated  during  this  analysis  are  summar- 
ized in  Table  A* 5*  They  are: 

* LINKHO  (portions) 

* C0MP2  (portion) 

* AVRX 

A.5.3  FMP  FORTRAN  Version 

Figures  A.ll,  A.12  and  A.13  repsectively  show  the  FMP  FORTRAN 
versions  of  AVRX  and  the  portions  of  LINKHO  and  C0MP2  simulated* 
Note  that  AVRX  and  C0MP2  make  substantial  use  of  DOALL  constructs* 
LINKHO  does  not  demonstrate  any  DOALL  constructs  since  it  is 
called  within  each  instance  of  sections  of  C0MP3.  LINKHO  is  an 

exceptionally  good  example  of  the  data  and  instance-dependent 
computation  in  C0MP3  which  would  execute  efficiently  on  the  system 
evaluated  even  through  it  would  be  difficult  to  vectorize.  The 
aerodynamic  flow  codes  analyzed  did  not  exhibit  the  independence 

between  instances  to  this  degree.  Substantial  use  is  made  of 

parts  of  the  language  that  see  little  or  no  use  in  the  two  aero 

flow  codes,  including: 

^ Domain  definitions  constructed  using  domain 
expressions  that 

include  previously  defined  domains*  (See  AVRX 
for  example) 

® INALL  declarations  (See  AVRX  for  example) 

Figure  A. 14  shows  the  branching  structure  of  the  subroutines. 
Note  the  presence  of  A^*B,  which  is  a form  of  call  on  the  EXP 

function* 


Table  A.  5 


GISS  WEATHER  MODEL 
BENCHMARK  SIMULATION  RESULTS 


Measure 

AVRX 

Routine 

C0MP2 

LINKHO 

Total  no*  of  CU  simulated  instructions 

48 

32 

30 

Total  no.  of  BU  simulated  instructions 

3800 

3094 

2529 

Total  no.  of  EU  machine  clocks  consumed 

25318 

23417 

16705 

Total  no.  of  floating  point  register 
related  instructions 

338 

900 

1058 

Total  no.  of  floating  point  arithmetic 
operations 

134 

688 

1266 

Total  no.  of  machine  clocks  for  F.  P. 
arithmetic  operations 

1039 

7052 

11624 

Total  no.  of  integer/ logical  instructions 

2800 

1449 

425 

Total  no.  of  control  type  commands 

662 

745 

1056 

Average  execution  time  for  all 
instructions  (NS) 

266.5 

302.7 

264.2 

Average  execution  time  for  floating 
point  operations  (NS) 

310.2 

410.0 

367.3 

Average  total  elapsed  time  for  floating 
point  operations 

7557.6 

1361.5 

527.8 
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V 


aOOO  C note  that  the  code  developed  BELON  is  MANUflULY  MAPPED 

iiOO  C TO  THE  HARDWARE  BY  STRUCTURING  THE  CODE 

i£00  C THE  DOMAIN  DEFINITIONS  ARE  USED  TO  ALLOCATE 

J.300  C WORK  TO  PROCESSORS  AND  TO  CYCLES  (INSTANCES) 

mOO  C WITHIN  EACH  PROCESSOR, 

1500  C 
160  0 C 

100000  SUBROUTINE  AMRX 

iOOiOO  STRUCTURE  COMMON  , , , . HERE  ARE  ALL  ARRAYS  IN  BLANK  COMMON,,, 

100200  GLOBAL  COMMON  HERE  ARE  SIMPLE  VARIABLES  IN  BLANK  COMMON., 

100300  X , ALPH<x6>, DRAT<16) 

lOOHOO  GLOBAL  COMMON  /WOTK/  PU(89,144) 

100500  DOMAIN  /PROC/j  PNO^sO  , 511 

100600  DOMAIN  /CYC/;  JNSTal,26 

100700  OOMAIN/RVRXD/t  /PROC/ . X . /C YC/ 

a00800  INALL/PROC/  TPU(28>,TTPU128),II<28),JJC2S>,EIM1,EI,EIP1 

100900  STRUCTURE  LOGICAL  DONE < JM  . FALSE , 

lOlOOO  C CALCULATE  DRAT < J > , ALPH C J ^ , H»<jNLIM(J;,  ONE  VALUE  PER  LATITUDE 
iOilOO  OOALL  Js£,JMAX-i 

101200  7DRAT  s DVP(2)/0XP<  J.J 

101300  ALPH<U;  s U , x25k<TORAT-1 )/FLDATCFIX(TDRAT ) ) 

iOmOO  DRAT<J>  s TORAT 

101500  NLIM<J)  = FIX(TDRAT) 

10x600  ENOOO 

101700  C LOAD  TPU  WITH  PU 

101800  DDALL  /PROCCPHO)^ 

101900  DO  1 M=1,26 

1019x0  C NOTE  THE  INSTANCE  NUMBER  WHICH  IS  COMPUTED  HERE 

102000  INND  s 51£A( M-1>+PND 

102100  IKM;  = INNO/JMAX  + 1 

102200  JU<M^  ;;  MPD(  INND,  UMAX  ^ + 1 

102300  IF  (II<M?  ,GT,  EXIT 

102400  7PU(M>  = PUf^UJvM;, 

^02500  1 CONTINUE 

102600  ENDDO/PRDC/ 

102700  C START  A DO  WHILE 
102800  100  CONTINUE 


Figure  A. 11  FMP  FORTRAN  Version  of  AVRX 
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10E900 

OOftLU  /PRDC<PND)/ 

103000 

EIMl  s TPU<1) 

103100 

El  = TPU<2> 

103E00 

DD  /CYC(INST)/ 

i03300 

I a IKINST) 

103400 

J a JJ^INST) 

103500 

IF  <I  ,ST.  IMflX;  EXIT 

103600 

IF  < (DRRTv J> , LT, 1)  .OR.  (N, 

103700 

DDNEC J)  a , TRUE. 

103800 

GO  TO  2 

103900 

ENOIF 

104000 

EIPl  a TPU<INST+2> 

104100 

XF<I.EG,0>  EIMiaPUi' J,  IMflX; 

104200 

IF  < 2 . EG,  IMflX;EIPl  = PUC J, 1 J 

i04300 

TTPUONST)  a El  + flLFH<I)x 

104400  C 

STORE  CftSES 

104500 

104600 

104700 

104800 

104900 

105000 

105100 


ir  ( < INST,  EG,  j.;  ,DR,  ; INST,  EG,  ii6)  ) PU<J,I>  s TTPU(INST> 
IF(I,EG,l;  PU1.J,  ;MftXPl)  = TTPU(  INST^ 
iF<  I , EG,  IMrtX  J PUC  J,  0 ^:sTTPU<  INST) 

EIMI  = El 
EI=  EIPl 
ENDOD  /CYC/ 

SYNCH  POINT 


INST)  :=  PU  < J J i,  I NST  ) , I I ^ INST  ) > 


EG,0>,OR.kI.EG.IMflXPl))TFUvIHST>spu<JJt,INST),II<INST)) 


105200 

NEXTDD 

105300 

DO  INST  a £ 

105400 

TPU(INST)  : 

105500 

ENDDQ 

105600 

DO  INST=1,2i 

105700 

TPUvINST)  : 

105800 

ENOOa 

105900 

DD  INST  a 

106000 

:f(<i.eg 

106100 

ENOOQ 

106200 

NaN-rl 

106300 

DD  Mal,£ 

106400 

TPU( 26+m 

106500 

ENDDD 

106600 

ENDDD/PRDC 

106700 

DDRLL  Jai., 

106300 

IF  (RUUkDI 

106850 

ENDOD 

106900 

GO  TO  lOO 

Figure  A»ll 

A-45 


iOOOOO  J5UBRDU7INE  LINKHD 

iOOiOO  CartMDN  /RRDCOM/PH:9>,FLECl0),FLKi:9>,Te,TS,  TLt9>,  TSTRC3) 

i00  20  0 i ,3HLt,9>,  CLOUD  < > , RE(iU  ),  RESTR<3)  , FLXDNS,S6,ftS<9^,RSSTR<3) 

100300  2 SC,CDS2,RSURF,SCQSZ,RflP,RftM 

100400  COMMON  /CLDCDM/  SWRLE v 16 > , SW IL < l5 > , ftU.  16 > , TRUL ^ l6 > , OZRLE < 16 ) , 

100500  1 TDPRBS 

lOOeOO  LOGICAL  CLOFLG, AERFLG, LI, LL 

100  70  0 RERL  TAUCIR, C TfiU55 , it  , P 1 0 , TN , RERl , RER2 , RERft,flERC, RERU, PERM, 

100800  1 EXl, EXE, OENU , DNMO , DNM I , RER V , EXTAU , TflU , RDNCN , EDNCN , TOFCN , 

100900  2 EUPCN, EDNCN 

101000  INTEGER  NCLDUD<1E> , HAERD<1£> 

101100  REAL  CIREXT(1£>,TAUN(1£,3),P1C1RC1£>,PI2<1E,1E>,CB<1E,1E;, 

lOlEOO  1 BTDP<14>,TDF<iE>,REF(lE), EUP ( l£ > , EON ( lE ) , TE3<301) , EUPC(l£) , 

101500  E EOWCCic), TDFC<1£> ,RDMC<1E> 

lomoo  C 

101500  C ADDITIONAL  DECLARATIONS  NOT  USED  IN  THE  SIMULATED  PORTION 

101600  C ARE  OMITTED  FDR  BREVITY 

101700  C 

101800  C STATEMENTS  ABOUT  PARALLELIS^f  ARE  OMITTED  ALSO  SINCE  LINKHD 

101900  C IS  CALLED  AS  A SUBROUTINE  WITHIN  THE  INSTANCES  OF  THE 

lOSOOO  C DDALL  /LAYERS/  OF  COMP3.  IN  THIS  CASE,  EACH  INSTANCE 

102100  C CALLS  LINKHD  INDEPENDENT  FROM  ALL  OTHER  INSTANCES  AND 

102200  C USES  A LOCAL  COPY  OF  CODE  WITHIN  THE  PROCESSOR  IN  WHICH 

102300  C THE  INSTANCE  RESIDES,  SEQUENCING  OF  THE  EXECUTION  WITHIN 

102400  C THIS  SUBROUTINE  IS  SOLELY  DEPENDENT  ON  THE  INSTANCE  AND 

102500  C LOCAL  DATA,  NOT  ON  ANY  OTHER  INSTANCES, 

102600  C 

102700  DO  200  LAM  :s  j.,x2 

102800  DO  100  K s 1,3 

102900  DO  101  N » 1,NLAYRS 

103000  NCC  :=  NCLDUDCNj 

103100  .:AER  = NAERO(N> 

103200  TAUCIR  S CIREXT<LAM;  X CTAU55  * NCC 

103300  X =;  TAUN<N,K>  + TAUCIR 

103400  TAUN(N,K;  = X 

103500  PIO  =(TAUCIRAPICIRD<LAM j + P I 2 ( L AM , N ) > / < X+l • E-^U > 

103600  IF(N,GE,4.J  THEN 

103700  TN  s tl(n-5>/273, 

103800  ELSE 

103900  TN  s tSTR(:n  )/273, 

104000  ENDIF 

104100  IF  < TN. QE. 0 , 85548  .AND,  NCC , GT , 0 J P I QsU . 

104200  IF<P ID. GT, 1. E-4;  then 

104300  AERi  si,-  PIO 

104400  AER£  s 1,  - vP10ACB<LAM,N^ 

104500  AERA  s SQRT< AER1/AER2) 
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flERU  « <1,  - ftERfl)/2, 

RERV  s <i,  -r  ftERR)/2, 

RERC  s SQRT<3,y£ftERlARER2) 

Xi  K -(fiERCAX; 

EXi  s 0,U 

ZF  <Xi  ,GE,  -130.218)  EXl  = EXPCXi) 

IF  CEXi.LT,i,UE-30 ) EXi=0,0 

EX2  =:  EXlAEXi 

DENO  1, /(  (RERVaRERM)  - (RERUARERUAEX2) ) 

DN«0  s;  i<BTOP<N;  - BTOP  < N + 1 ) / v X ARERC  > ) A 
(<RERM  - RERUAEX2)  - (RERRAEXi)) 

DHrti  s RERM  + RERUAEX2 

EUP(N>  = <BTOP<N>AONMl  - DNMO  - BTQP < N+1 ) AEXl ) A 
OEHO aRERR 

EDUvN;  s (BTaP<W+i )ADNM1  + DNMO  - BTnP<N>AEXl)A 
DENU ARERR 

ref<n;  = RERUARERVAC 1, -EX2)ADEN0 
TOFtN)  CflERV-RERU  JaDENOAEXI 
ELSE  IF  (NCC.GT.U^  THEN 
7DF<N)  a 0,0 
REF^N)  s 0.0 
EUP<N^  = BTDP<N> 

EDN(N>  a BTDP(N+i) 

ELSE  IF  < X.LT.1.E-4J  THEN 

TDF<N/SX. g 

rcup^'N/  s u.*o  K^PHOi)UCIBILITlt’'  OF  THfi 

EoruN^  =0.0  ‘'aJGTKAi,  PAGE  IS  POOR 

ELSE 

IF  vX  .LE,  j.5.0;  THEN 
EXTRU  = EXP<-Xj 
I T V = ; ( A 2 0 , T 1 , 

TDF(,N>  = TE3<ITY>  + ^TY-ITY  + l)  A i TE3  < 1 T Y + 1 ) -TE3  < I T Y ) ) 
ELSE 

EXTRU  = O.g 
TOF(N>  = 0.0 
END  IF 

REF<.W^  - 0.0 
XI  = 1,0  - TDFO-O 

:C2  = \<i,0  - EXTRU  „i/X-TDF<H  J > a <<BTQP(N)  - BTOP(Nt1);a 

0,6666) 

EDNifN;  « BTDP(Ntl)  AXl  + XC 
EUP<Nj  a BTOP(>nAXl-x2 
END  IF 

DENO  = - RDNCNaREF* N j ) 

EONCN  = V EDNCNtEUP<N  J aRDNCN  I a TDF<N  j A DENO  -r  EDN(N> 
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X09000 

IF  <NCC,6T,0)  CL0FL6  « .TRUE. 

i09i00 

IFCCLDFLS. RND. PID, QE, 1. UE-4>  ftERFLGc , TRUE , 

109200 

IF  ^ , 1407,  vCLOFUS,  DR.  fiERFLG)  ) THEN 

109300 

TftU  5=  TftU  -r  X 

i09»l00 

IF  (TRU  ,GT.  15>  THEN 

109500 

TOFCN  = 0. 

109600 

ELSE  IF  < <20 , ATflU+i, / , L7, i;  THEN 

109700 

: TY=i 

109300 

TOFCN  £i  TE3< ITY  ) + <TY-ITY  + 1>A<TE3( ITY  + 1;-TE3< ITY J ) 

109900 

END  IF 

llOOOO 

ENDIF 

llOlOO 

IF  <RERFLG>  THEN 

110200 

RDNCN  a REF(N/  TDF  < N ^ TOF  < N ^ ARDNCNRDENO 

110300 

TOFCN  = TOFCNATOF(N^3«(OENO 

llOHOO 

ENDIF 

110500 

IF<NCC,I4E.0  .OR,  F ID,  LT.  1,  OE-4)  THEN 

110600 

TOFCN  r:  U.U 

110700 

RONCN  s 0,0 

110300 

TflU  s 0,0 

110900 

ENDIF 

111000 

EUPC^N^  » EUPCN 

illlOO 

EDNC<Nj  s EDNCN 

111200 

TDFCCNii  :s  TOFCN 

111300 

RDHC<N>  s RONCN 

111350 

101 

CONTINUE 

111400 

DD  118  M s NG-1,1,-1 

111500 

DEHD  = 1, 0/^  1. 0-RUPCNXREF<M > ) 

111600 

EUPCN  s EUP<M;  ♦ < <EDN^M>^RUPCN  + EUPCN  ) x T OF  CM  ^ >«DENU 

111700 

IF  <M.NE,i;  THEN 

111300 

RUPCN  a REF(Mj  ^ <TDF  <M  )ATOF<MMRUPCNxDEN0  ) 

111900 

L s M-x 

112000 

DENU  s 1, / <1, -RDNC< L;xRUPCN ^ 

112100 

PEFUP  ti  (.EUPCN  T EDNC<L;  ARUPCN;xDENU 

112200 

PEFON  s (EUPCN  + EONC<L;aRONC<L; )AOEN0 

112300 

ELSE 

112400 

PEFUP  = EUPCN 

112500 

PEFON  ::  0,0 

112600 

ENDIF 

112700 

FE(M>  » FE(M>  + vPEFUP-PEFDN;ACLKRM 

112300 

116 

CONTINUE 

112900 

100 

CONTINUE 

113000 

200 

CONTINUE 
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90100 
90200 
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lOOiOO 
i.00200 
*00300 
100400 
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iCOOOO 
*00900 
lOlOOO 
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*03200 
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104500 


C 

C 

c 

c 

c 


c 

c 

c 
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iOO 


cOO 


c 


VHIS  IS  THE  SECTION  OF  GiSS 


CDIIf>2  THAT  NAS  SIMULATED 


CORIOLIS  FORCE 


DOALL  Jiic,  JHMi;  Isl,  IN 
IF  ( : . ECt.  1^  THEN 
nils  I M 
ELSE 
I Ml  ; 

END  IF 

rD<i, ; > u 0.0 
rO<  JM,  I ; U , U 

DO  100  L«1,NLH|' 

HERE  THE  COMMON  SUBSCRIPT  EXPRESSIONS  ARE  NOT  GIVEN 

BUT  THE  COMPILER  IS  ASSUMED  TO  HAVE  EXTRACTED  THEM  APPROPRIATELY, 


~ FtJji  r DXVP(JJ  ▼ U,25A<U(JjIjL/TU^J,*‘'*»LvTUvJ*rljIjL/' 

f UvJtI, i“l}L//x(OXUvU/~DXU\Jf*^) 


COr^TXNUE 

ENUOQ 

DOALL  J:-2«JMiI::I,IM 
IF  *;i,EO,l>  THEN 


iH*  w IM 
ELSE 
I Mi  X 
END  I 


DO  200 

rxco 

ALPH 
U V < J 
UT  ^ J,  * M* , L ^ 
U T ( J , * , L ^ ~ 
J,  IMit  L^ 


L u 1,NLAT 
0,i25wDJ 

(P<  J 


EEPROWffilLITY  flf  THE 
ORIGINAL  PAGE  IS  P^R 


a FXCO 
I j U “ 


j I >tP*.  J — *«  I * ) Apfc  vU* 

U*i'\JtI«L^  T ALPH«M^,J,i^L^ 
u UV\J,IM*,L^  T AUFHkV(U,IM*,Lj' 
U7^J,;,LJ  - ACPHkUv.  J,  . * L i 

ir  MVv*,iMi,L^  - m UP  HxU  v J , I Mi  . L / 


> i ^ > 


CONTINUE 

ENDOD 
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iUHiiuO  c 

l0^700  c 

iUi*i50  0 

^UH900 

*U5100 

*05E00 

iU5300 

X05HOO 
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iOG600 

•U5700 
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*m5000 
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4.00200 

iO  650  0 
10C400 
i06500 
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X06900 
^07000 
iu  7iOO 
j.07200  500 
4.0  7300 
4.07400  C 
X07500  C 
i07600  c 


ujCRTICrtL  H0Mt:C7ION  OT  THKRMDDYNHM 1 C ENERGY 

DonuL  J^4.«  jm;  iH 

DO  300  Ltix  , NLH'i  Mj. 

UFA  :**  A <;  J , i ,1 
LiJIGA  « 51G(L  ^ 

L31iG&  » 

FKX  - E;<FeVKt,FTRDP  t tSIGflJ^LFA) 

PK2  EliFEYK  ( FTROF  f LSIGBAUFft  J 
LDSXGA  K 0SX€(L; 

UDSXGB  u OBXGvLrX) 

L7fl  ti  7vJf4yU^ 

L70  ti  T\J»*jLTi/ 

LSDA  » 50(3,  U > 

LFXTft  a P X T( J,  X ; 

CUl  a UDSIGB/ i UDSIGft  + UDSIGB  ) 
cue  a 4.  , U **  CUl 

7ETAM  a CO  iAL7ft/PKX  r COC.HLTB/pKa 

L7TR  « TT(J,;,u;  y 07  A < LS  I GAAKHPARLF-ft^LTRALP  I T A/PUl- 
1 U5DAATETAMAPKX/LOSI6ft > 

L7TB  aTT '.  J,  K,  Lt  j.  J t D7  A LSDft  A T E Tft  M AP  KE /UDS  I GR 
:F  (UP! , eg, HLhV J L7TBaL77B  + D 7 AL£ I 6BAKRPR ALPAALTB A 
1 LPITR/PUL 

77(J,X,L>  a L77R 
TTvJfXjLrJi/  a L7TE 
C0H7IHUE 
EHDDD 

CDI1P2  CONTINUES  BEYOND  HERE,  “HIS  IS  THE  END  OF  THE  PIECE  SIMULATED 
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A*5.4 


Results 


Table  A*  6 shows  some  of  the  assumptions  and  summorisjes  the  results 
of  the  analysis.  Table  A* 7 shows  a more  detailed  breakdown  of  the 
analysis  by  subroutine. 

This  is  a worst-case  analysis,  in  that  the  data  dependent  branches 
were  assumed  to  demand  the  most  computations.  This  was  done  in 
order  to  estimate  the  worst-case  maximum  running  time  of  the  GISS 
weather  code.  Those  weather  conditions  that  result  in  faster  run- 
ning, such  as  clouds  that  reduce  the  amount  of  radiation 
computation  needed,  will  result  in  a faster  total  run  for  the 
whole  program.  They  also  result  in  fewer  computations  actually 
performed. 

An  interesting  detail  of  the  analysis  concerns  the  assignment  of 
instances  to  processors-  In  the  prototype  compiler,  instance 
nujiiber  is  computed  as  described  in  the  section  on  FMP  FORTRAN,  and 
instance  number  i is  processed  in  processor  number  (i  mod  521). 
That  is,  at  the  beginning  of  the  DOALL,  processors  0 through  511 
are  given  instance  numbers  0 through  511  to  do,  and  then  each 
processor  increments  instance  number  by  512  to  find  its  next 
instance  to  do,  until  all  instances  in  the  DOALL  are  exhausted. 

In  C0HP3,  a major  contribution  to  whether  a given  instance  will 
run  for  a long  time  or  a short  time  is  the  condition  of  night  vs. 
day.  Radiation  computations  are  much  simpler  on  the  dark  side  of 
the  earth.  At  the  equinox,  computations  would  be  for  daytime 
along  72  meridans,  and  for  nighttime  along  72  meridians.  As  the 
DOALLs  are  arranged,  with  latitude  subscript  J first,  all 
processors  do  daylight  instances  together,  and  all  processors  do 
nighttime  instances  together.  This  argument  is  somewhat 
oversimplified,  because  of  dawn  and  twilight  effects,  and  must  be 
modified  for  other  seasons  where  all  points  around  one  pole  are  in 
daylight,  and  the  other  are  in  darkness.  However,  more  detailed 
analysis  still  confirms  that,  for  the  GISS  weather,  the 
straightforward  assignment  of  instance  number  to  processor  number 
results  in  nearly  equal  distribution  of  not  only  daytime  and 
nighttime  within  each  individual  processor  but  also  latitude, 
thereby  helping  to  distribute  the  computational  effort  evenly 
among  all  processors,  and  tending  to  make  them  finish  nearly  all 
together  at  the  end  of  C0MP3. 

A. 6 SPECTRAL  WEATHER 

A. 6*1  Summary 

The  spectral  weather  is  expected  to  run  with  substantially  higher 
throughput  than  the  GISS  weather  does.  Its  fluid  dynan\ics 
portions  are  done  by  spectral  analysis,  with  each  processor 
processing  an  PFT  independently  of  all  other  processors.  (For  a 
discussion  of  the  case  that  only  one  FFT  is  to  be  executed  in  the 
FMP,  see  Section  A. 7.)  Thus,  the  fluid  dynamics  computations  are 
much  more  locally  contained,  since  all  the  intermediate  results  in 
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TABLE  A,6 


Inputs  Parameters 

Grid  Size 
Time  Step 
Total  Time 
Total  Time  Steps 
NCYCLE 
NHOGAN 
N COMP 3 

Output  Result  Totals 

Flops/ EU 
Max  Time/EU 
Flops/ Systems 
Gf lops/ System 


« 89  X 144  X 9 

« 20  minutes 

- 14  days 

= 1008 
« 6 

= 5 Radiation  call  frequency 

« 3 Physics  call  frequency 


= 2.88  X 108 

= 4.42  minutes* 

= 1.41  X lOll 

= 0.532 


* Does  not  include  system  startup  time 


Performance  Analysis  of  GISS  Weather  Model 
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the  FFT  can  be  contained  within  processor  memory.  The  chemistry 
and  physics  portions  of  the  spectral  weather  code  are 
substantially  identical  to  those  of  the  GISS  weather  code/  and  the 
analysis  of  one  can  serve  as  the  analysis  of  the  other. 

Therefore/  the  fluid  dynamics  portions  of  the  spectral  weather 
code  are  expected  to  run  somewhat  better  than  the  fluid  dynamics 
portions  of  the  GISS;  the  chemistry  and  physics  portions  would 
have  the  same  throughput  exactly.  (This  ignores  the  effects  on 
throughput  of  individual  programming  style/  assuming  that  the 
spectral  weather  style  is  no  better  nor  worse  than  the  style  seen 
in  GISS.) 

The  FMP  FORTRiVN  of  Figure  A.  15  shows  the  essential  portions  of 
subroutines  GDSPCl  and  FFTFOR.  It  is  clearly  efficient.  The 
inner  loop  in  particular  has  only  singly  indexed  local  variables 
and  a substantial  proportion  of  multiply-add  operations.  A char- 
acteristic of  all  these  loops  is  a short  string  of  integer 
operations  after  the  index  test  and  before  any  floating  point 
operations  start.  A characteristic  of  all  these  loops  is  a short 
string  of  integer  operations  after  the  index  test  and  before  any 
floating  point  operations  start.  The  slowdown  due  to  integer 
indexing  that  appeared  in  BTRI  does  not  appear  here  except  in  DO 
317  loop  in  GDSPCl  (see  Pig.  A.15)  which  is  done  N times/  whereas 
the  inner  loop  is  done  N*LOG  N times. 

All  local  variables/  including  the  local  arrays/  are  substantially 
less  than  the  4096  words  of  address  space  that  is  accessible  rela- 
tive to  the  stack  pointer/  and  which  is  reserved  to  subroutine 
local  use.  Such  examples  support  the  decision  to  have  relatively 
short  address  fields  within  the  instruction. 

An  estimate  of  0.6  Gflops/sec  was  made  for  the  spectral  weather 
code.  This  estimate  is  not  yet  based  on  the  detailed  analysis 
performed  on  the  other  codes.  The  estimate  is  based  on  prior 
knowledge  of  the  chemistry  and  physics  portions  of  the  GISS 
weather  and  an  initial  evaluation  of  the  efficiency  of  execution 
of  the  FFT  portion  as  described  above. 

A.6.2  FMP  FORTRAN  Version  of  FFT  Portion  of  Program 

In  the  MIT  spectral  weather  code/  the  FFT  appears  as  subroutine 
GDSPCl  which  calls  on  subroutine  FFTFOR.  GDSPCl  takes  NN  arrays 
of  data/  splits  each  array's  odd  and  even  parts  symioetrically 
about  the  center  index,  and  rearranges  the  odd  and  even  parts  into 
real  and  imaginary  parts  of  an  array  of  complex  values.  FFTFOR 
then  performs  NN  fast  Fourier  transforms  on  the  NN  jmplex  arrays, 
producing  NN  transforms. 

Ni;  is  equal  to  the  number  of  layers  times  the  number  of  meridians. 
In  a model  with  144  meridians,  89  latitudes,  and  12  layers,  NN  = 
12  X 144  = 1728  transforms  would  be  performed  at  once. 
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rMP  rPRTRftN  VERSION  OF  FOURIER  TRRNSFORH  PORTION 
OF  MIT  SPECTRAL  WEATHER 


SUBROUTINE  SDSPCX < DSPEC , OATARL , OAT A I M , NUEU J 
X NGPAR<7>,LaSN  ^ 


COMMON  /FTCST/  N.NUAT 
STRUCTURE  DATARL<1)  , OATAIMd)  , OSPEC<i) 
DOMAIN  /SPACE/; INsl , n;  I ~1 , WLAT J L si , NLEU 


SIMPLIFY  CODE  BY  ASSUMING  N EVEN  FOR  THE  TIME  BEING 

’’innn"  ^ = ^ >/=/SPftCE < * , I , l / / 

IDDD  s MDD<NLEV,2> 

NL2  s NLEU/2 


COMBINE  THE  DATA  FROM  LEVEL  1 AND  LEVEL  NL2+1 
AND  FROM  LEVEL  2 WITH  LEVEL  NL2t2,  ETC,  INTO 
THE  REAL  AND  IMAGINARY  PARTS  OF  THE  DATA  INDEXED 
ON  LEVEL,  THE  FFT  IS  THEN  DONE  ON  THE  COMPLEX 
DATA  WHICH  IS  THEN  UNRAVELED  IN  SUBROUTINE 
SPCGDl 


DOALL/LATLVEv I , L;/  ; USING  OATARL 
DO  316  IN  « i,W 

DA7A1M(1N, I,L>  « DATARL < 1 N , I , L tNL2 > 

ENODO/LATLV2/  ; GIVING  DATAIM 
CAUL  FFTF0R<DATARL, DATAIM^ 

OOALL  /LATLV2( I , L//  J using  D ATA 1 M , DATARL 
DO  317  IN  s i,N 

datarlOn,  I,  ltnl2>  k oataim(  in,  I , L^ 

OATAIMCIM,  l,L-rNL2;  u Q , DO 

” 0«TARL<IN,  I,L>  OATAIMON,  I,L-rNL2^ 
- 0«T«PU  in!  I L.NLai 

ENODO/LATLV2/  5 GIVING  DATARL , DATA I M 
CALL  F0RSPC<DSP£C,0ATARL,DATAIM,NLEV; 

RETURN 

END 


ALTERNATE  ENTRY  SPCGDl  IS  VERY  SIMILAR,  OMITTED  FOR  NOW 


Figure  A. 15  PMP  FORTRAN  Version  of  GDSPCl  and  FFTFOR 
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SUBROUTINE  rrTFDR< OfiTRRL , DflTfl I M ) 

STRUCTURE  0RTARL<1),  DftTRIM(l) 

DOUBLE  PRECISION  MP^MfUE 

COMMON  /rrr/  WP<7, 7, 15) , W<2, 7) , NTRRNS(i6) , LRI, NURTHP, 

1 N(SPflR<7>  ,LOaW 

COMMON  /FTCST/  N,NLRT 

FOR  BREVITY,  THIS  EXAMPLE  IS  SIMPCIFIEO  TO  THE  FORWARD 
FFT  ONLY,  LEAUINS  THE  REVERSE  TRANSFORM  TO  BE  ADDED  LATER 

DOMAIN  /SPACE/; IN“1, N; Isl, NLAT; Lsl, nLEV 
DOMAIN  /UATUV2/;  I=l,Nl.fiT;  L*=1,NL2 
INALL  /LATLEV/  CTRL ( 15 > , DT I M < 15 ) 

ISISN  = X 

OOALL  /LRTLEV< I ,L^/i  USING  D AT ARL , 0 AT A I M , /FFT / , /F TCST/ 

DO  12 

DTRL<J>  = DATARLt NTRANS< J) , I , L; 

DTIM<J;  = DATAIM(NTRANS<U) i I , L> 

OTIM<NTRAKS( J> ^ - D ATA I M < U , I , L ^ 

OTRL(NTRANS( J > J s D AT ARL < U , I , L ; 

DO  15  Jisi,W/E 

7EMPR  = DTRL(2aJ-1^  t DTRH.2aJ^ 

TEMPI  s:  OTIM<2aJ~x;  t DTIM(2)«(J^ 

OTRL<2xJ^  s DTRL<2kJ-1)  - DTRL<2kJ^ 

DTIM<2aJ>  = OTIMi'2AJ-l>  - DTIM<2RJ; 

D7IM<2)«tJ-i;  ss  TEMPI 
07RH:2«J-j.)  u TEMPR 

DO  90  11  ::  2«LOGN 
HUM  s £xxll 
WUMHF  r:  NUM/c 
NSS  = 1A2AJ<C  (LOGN-I  I J 

THE  ABOVE  EQUALS  N/NUM,AS  SHOWN  IN  THE  ORIGINAL  PROGRAM, 
BUT  POWERS  OF  2 ARE  NUMERIC  SHIFTS,  MUCH  FASTER  THAN  A 
DIVIDE 

DO  90  0=1, NSS 
NUMJK  = NUMxCJ-x) 

LL  = i+NUMJK 
MM  = LLtNUMHF 

NOTICE  THE  OEUETIOt4  FROM  THESE  VARIABLES  OF  OFFSETS 
CORRESPONDING  TO  THE  DOMAIN  VARIABLES  WHICH  APPEARED 
IN  THE  ORIGINAL 


FMP  FORTRAN  Version  of  GDSPCl  and  FFTFOR  (Cont*d) 


108900  C 
109000 
109100 
109200 
109300 
109400 
109500 
109600  C 
109700 
109800 
109900 
iiOOOO 
110100 
110200  C 
110300  C 
110400  C 
110500 
110600 
110700 
110300 
110900 
liiOOO 
111100  90 
111300 
111350  C 
111360  C 
111370  C 
111330  C 
111390  C 
111400 
111500 
111600  100 
111700 
111750 
111800 


TEHPR  s 

TEMPI  s 

OTRKMM) 

OTIM(MM> 

DTRLCUU; 

DTIM(LLy 


DTRl.<LL>  * 
DTIM<LL>  f 
s DTRU(LL 
a OTIM<UL 
a TEMPR 
a T I M P I 


DTRMMM^ 

) - DTRLCMM) 

f - OTIMCMM; 


DQ  90  Ka2,HUMHr 
LL  a KtNUMJK 
MM  a LL  -r  NUMHF 
MMM  tt  USSACK-L) 
W2  a -W^;2,MMM> 


NOTE  THftT  THE  RBOVE  WOULD 


BE  CONDITIONAL  SIGN  IF  REVERSE  FFT 


CROSSR  a 

DTRL<MMi  A 

CRDSSI  a 

OTIMC'MM^ 

DTRL(MM> 

a DTRL(LL^ 

DTIM(MM^ 

a DTIM<LL/ 

dtrlCll^ 

a DTRL<LL; 

DTIMCLL/ 

a DTIM(LL^ 

CONTINUE 

DO  100  1 lal, 1 

N 

Wkl^ltMM;  + DftTAIM<MM)J«(W£ 
WC1,MMM^  - OATflRL<MMMW£ 
- CRDSSR 
“ CRDSSI 
T CR0S5R 
▼ CRDSSI 


IN  STRUCTURE  VARIABLES 
SUBTRACT  FROM  EXPONENT, 
DIVIDE  BY  N 


NORMALIZE  AND  PUT  BACK 
DIVIDE  BY  2AALOGN  IS  A 
RUNS  MUCH  FASTER  THAN 


DATARL( I I , I , L/  a OTRL < I I ) / 2 AAL08N 
DATAIMi:  1 I , I , L^  a OTIM^II)  / 2«ALDGN 
CONTINUE 
RETURN 

ENDOO/LATLEV/;  GIVING  DAT ARL , D A T A I M 
END 


Figure  A. 15 


FMP  FORTRAN  Version  of  GDSPCl  and  FFTFOR  (Cont'd) 


The  obvious / and  simple,  strategy  is  to  have  a DOALL  on  layers  and 
longitudes,  with  each  instance  performing  a serial  transform. 
That  is: 

DOAIiL  1=1,144;  L=l,NhEV 

...  here  the  code  for  a serial  fast  Fourier  transform 
ENDDO 

One  of  the  optimizations  in  the  original  program  needs  to  be 
undone  in  order  to  separate  the  loop  into  a large  DOALL  and  a 
short  DO  loop.  The  original  version  took  the  multidimensional 

arrays  that  naturally  appear  in  the  problem  and  unwound  them  into 
one-'dimensional  arrays.  Thus,  a substantial  amount  of  index 
computation  was  saved  by  doing  the  index  calculations  separately. 
In  order  to  make  best  use  of  the  FMP,  the  structure  inherent  in 
ifehf  xfircbdLf\t  laafds  to  be  retained. 

ilie  FMP  FORTRAN  version  shown  in  Figure  A.  15  includes  the  conver- 
sion from  space  variables  to  complex  function,  the  forward  Fourier 
transform  (complex)  on  the  complex  function,  but  omits  the 

conversion  from  complex  function  back  to  real  frequency  functions, 
and  also  omits  the  reverse  Fourier  transform,  since  both  of  these 
are  trivially  different  from  the  code  that  is  exhibited. 

The  arrays  DATARL  and  DATAIM  in  the  original  FORTRAN  version  are 
used  both  to  hold  the  entire  input  and  output  files  of  the 

transform,  and  also  to  use  as  working  space  during  the  course  of 
the  transformation.  In  this  FMP  FORTRAN  version,  two  STRUCTURE 
arrays  DATARL  and  DATAIM  are  used  to  hold  the  entire  input  and 
output  files  before  and  after  the  transformations,  but  two  LOCAL 
arrays  DTRL  and  DTxM  are  used  as  the  working  space  during  the 
course  of  the  transformation. 

Each  processor  is  doing  one  FPT  serially.  There  are  as  many  FFT*s 
being  executed  as  there  are  points  around  the  equator  times  the 

number  of  levels.  The  code  as  exhibited  therefore  would  be 

efficient  only  for  grids  somewhat  finer  than  the  16  latitudes  x 24 
longitudes  of  the  MIT  code  as  submitted. 

The  following  list  gives,  in  sequence,  the  variables  that  are 
candidates  for  being  assigned  to  registers.  This  list  covers 
subroutine  FFTFOR  only. 
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INTEGER  REGISTERS 


PLOATING--POINT  REGISTERS 


Stack  pointer  TEMPR 

Base  address  DATARL  TEMPI 

Base  address  DATAIM  W2 

Cycle  index  for  IN  ALL  CROSSR 

j CROSSI 


Base  address  of  coinition/FTCST/ 

N 

Processor  no* , for  DOALL  control 

I 
L 

N/2  (loop  limit) 

2^J  (common  subexpression) 

2*J-1  (common  subexpression) 

Base  address  of  common  FPT/ 

II 

LOGN 

NUM 

NUMHP 

NSS 

NUMJK 

LL 

MM 

K 

mt\ 

2**LCX5N  (common  subexpression) 


In  addition  to  the  above  list^  some  scratch  and  accumulator  regis- 
ters  need  to  be  assigned  (some  double  length)*  As  was  observed  in 
the  analysis  of  the  explicit  code,  more  integer  registers  would  be 
needed  to  avoid  saving  and  restoring  them*  The  number  of  floating 
point  registers  is  adequate* 

A. 7 OTHER  ANALYSIS 

As  an  example  of  additional  applications,  this  section  will 
discuss  two  application  areas  that  fall  outside  of  the  benchmark 
programs*  The  first  section  discusses  how  well  the  FMP  would  do 
on  FPT's  when  only  one  FFT  is  being  done  instead  of  512  FPT's 
operating  efficiently  in  parallel  as  in  the  spectral  weather* 

A*7*l  Fast  Fourier  Transforms  on  the  FMP 


A*7* 1* 1 Discussion 

This  section  makes  some  preliminary  estimates  of  the  through- 
put of  the  FMP  executing  a single  FFT  across  the  entire  array*  If 
data  length  is  assumed  to  be  a power  of  2 and  at  least  512  long, 
the  resulting  throughput  is  estimated  to  lie  between  0*6  and  1*0 
Gflops/sec*  *]^e  exact  throughput  figure  is  dependent  on  the 
algorithm  selected  for  the  FFT*  This  section  is  a discussion  of 
the  algorithms,  and  a description  of  hov'  they  operate*  An  FMP 
FORTRAN  version  of  one  of  the  algorithms  is  presented. 


Algorithms  which  have  the  final  result  stored  on  "scrambled** 
indices  were  developed  to  allow  in-place  computation  to  save 
memory*  The  data  interactions  in  these  algorithms  correspond  to 
swapping  data  between  the  upper  half  and  the  lower  half  of  some 

subset  of  the  data,  the  subset  being  a power  of  2.  At  the  end, 

the  scrambled  data  is  stored  in  memory,  the  indices  are 

bit-for-bit  reversed  (so  that  0000011  becomes  1100000),  and  the 
reversed  indices  are  then  used  to  reorder  the  resulting  data* 

Other  algorithms,  such  as  Glassman’s  [4],  require  that  the  data 
interchange  in  the  body  of  the  algorithm  be  a perfect  shuffle* 
There  is  no  rearrangement  required  at  the  end* 

For  a 512-point  PFT  the  computations  would  be  fully  parallel 
across  the  processors,  and  the  swaps,  shuffles,  or  rearrangements 
would  take  place  on  all  data*  For  the  512-point  case: 

For  the  "scrambled**  algorithms,  there  are  9 swaps  and  1 

rearrangement  * 

For  the  Glassman  algorithm,  there  are  9 perfect  shuffles* 


For  FFTs  with  more  points  than  512,  the  amount  of  data  being 
swapped  doubles,  and  the  number  of  swaps  goes  as  log2(N),  while 
the  number  of  multiplications  and  additions  is  proportional  to  N 
log2(N)*  There  are  exactly  1/2N  log2(N)  complex  multiplications 
in  the  Glassman  algorithm,  for  taking  the  Fourier  transform  of  a 
real  variable  (since  the  odd  and  even  parts  of  the  real  function 
can  be  combined  into  the  real  and  imaginary  parts  of  a complex 
function  defined  over  half  as  many  points)* 

Thus,  the  time  required  for  each  of  the  following  needs  to  be 
considered* 


® Swapping  N/512  items  of  data 
^ PERFECTSHUFFLING  N/512  items  of  data 
REARRANGING  N/512  items  of  data 


The  times  for  the  above  would  then  be  inserted  into  a formula 
where  SWAPping  and  SHUFFLing  are  multiplied  by  log2N.  As  a first 
approximation,  these  times  would  be  added  to  the  time  taken  for 
computation  to  get  the  net  time  for  an  FFT*  The  result  is  that 
the  "scrambled**  versions  of  the  FFT  run  substantially  better  on 
the  PMP,  since  the  SWAP  is  the  SHIFTN  operator,  while  SHUFFLing 
and  REARRANGING  are  stores  to  EM  followed  by  fetches  from  EM* 
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A*7.1.2  Timing  Estimates 

Within  the  inner  loop  of  an  FFT,  all  processors  do  the  same  compu- 
tations , and  hence  will  stay  in  synchronism*  Any  synchronizations 
required  do  not  imply  any  significant  time  wasted  waiting  for  the 
slowest  processor* 

A SWAP  consists  of: 

N/512  SHIFTN  instructions,  at  12  clocks  each. 

A PERFECTSHUFFLE  consists  of: 

N/512  STOREMs*  Each  STOREM  occurs  in  its  proper  place 
within  its  own  instance;  not  as  a string  of  successive 
STOREMs.  Hence  processing  can  be  concurrent  with  the 
write  to  EM* 

NEXTDO  (the  splitting  of  a DOALL  into  two  successive 
DOALLS)  requires  the  termination  of  the  instances,  a 
synchronization,  and  the  hidden  cycle  loop  of  the  subse- 
quent DOALL*  Hence,  the  following  code  is  executed  in 
the  processor, 

IJUMP  % end  of  instances 

WAIT  % processor  side  of  the  synchronization 

IMOVEL  % cycle  loop  variable  initialization 

IMOVEL  % cycle  loop  limit 

ITIX  % cycle  loop 


which  has  a total  of  13  clocks  (before  correcting  for  overlap 
and  instruction  fetching). 

N/512  LOADEMs*  If  the  STOREMs  are  to  EM  modules  v/ith  a skip 
distance  of  2,  the  perfect  shuffle  has  the  LOADEMs  at  a skip 
distance  of  1,  which  is  one  of  the  "'magic”  skip  distances  at 
which  the  CN  has  no  conflicts* 

The  final  formula  for  timing,  in  terms  of  number  of  clocks, 
using  T^m  to  indicate  the  number  of  clocks  per  EM  access 
(include  address  computation)  is: 

Tps  = 2(N/512)Tem  + 13 

A REARRANGING  of  the  data  on  scrambled  indices  consists  of 
N/512  STOREMs,  all  occurring  in  succession,  followed  by  the  13 
clocks  of  the  NEXTDO,  followed  by  N/512  LOADEMs*  The  STOREMs 
are  in  succession,  so  EM  module  busy  will  keep  them  at  least  9 
clocks  apart,  but  the  ”EM  busy”  of  the  last  STOREM  can  be 
hidden  behind  the  13  clocks  implied  by  the  NEXTDO* 
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In  addition,  the  subscripting  on  scrambled  indices,  bit 
reversed,  is  the  worst  possible  permutation  for  CN  conflicts. 
It  will  take  16  times  CN  access  time  plus  EM  cycle  (144 
clocks)  to  get  all  512  requests  through.  These  additional  144 
clocks  are  approximately  the  same  whether  the  bit  reversed 
scrambled  indexing  occurs  on  the  STOREM  or  the  LOADEM.  A 
formula  is  thus 

Tr  « 2(N/512)(Tem  + 144)  + 13 


The  time  added  to  the  entire  FTT  by  these  operations  can  now  be 
computed.  Remembering  that  there  are  1092^  passes  through  the 
inner  loop,  time  for  the  ‘‘scrambled  indices”  algorithm  (after 
reducing  the  formula)  isj 

T = 1o92-^(N/512)*12  + 2(N/512)(Tem  144) 

For  the  perfect-shuffle  (Glassman)  type,  time  is; 

T = log2N(N/512)*TEM  + 13  log2N 

These  are  the  times  spent  in  data  rearrangement.  In  addition, 
there  are  (N  logN)/1024  complex  multiplicat ions  and  additions  per 
processor,  or  4 N 1092^  floating  point  operations  per  processor. 
Simulation  shows  that  real  programs  that  are  fairly  well  adapted 
to  the  FMP  run  at  about  1.3  Gflops  (AMATRX,  BTRI).  This  is  about 
9*8  clocks  per  floating  point  operation.  If  the  rest  of  the  PFT 
does  as  well,  the  time  spent  in  computation  is  39*2  log2N(N/512) 
clocks. 

Table  A. 8 shows  these  numbers,  together  with  an  estimated  through- 
put rate  (in  Gflops)  for  the  FFT  assuming  that  there  is  no  overlap 
and  that  otherwise  all  the  above  assumptions  hold.  An  estimate  of 
35  clocks  for  Tem  was  used  to  cover  address  computations,  CN 
delay,  and  EM  access  time. 

A. 7. 1*3  FMP  FORTRAN  Version  of  Glassraan’s  FFT  Algorithm 

Figure  A.  16  is  an  example  of  a FFT  coded  for  the  FMP.  The 
attached  PFT  is  Glassman *s  algorithm,  and  does  not  scramble  the 
indices. 

The  FMP  FORTRAN  in  Figure  A.16  is  a rather  direct  translation  of 
an  existing  ALGOL  program  (Figure  A. 17).  In  translating  from  the 
ALGOL  to  the  FMP  FORTRAN,  it  is  likely  that  the  result  is  not 
optimized  for  the  FMP.  Specifically,  the  perfect  shuffle  in  this 
particular  code  consists  of  fetching  the  Z items  on  shuffled 
indices  into  an  INALL  array,  then  the  NEXTDO  for  finishing  all 
instances  for  data  precedence,  and  then  four  successive  STOREMs. 
Successive  STOREMs  are  not  overlapped  as  they  would  be  if  mixed  in 
with  computation. 
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This  program  runs  for  any  binary  value  of  N whatsoever^  but  is 
efficient  only  for  N equal  to  512  or  greater*  lljo  language  is 
completely  independent  of  the  number  of  processors. 

Ohe  ALGOL  program  of  Figure  A*  17  is  a free-standing  program,  whicli 
reads  in  a data  deck  and  prints  out  the  transform.  It  was  written 
for  demonstration  purposes,  to  show  that  the  Glassman  algor  it  Inn 
had  indeed  been  understood  and  programmed.  For  the  FMP,  it  is 
assumed  that  some  main  program  supplies  the  data  and  uses  the 
results.  Ihus,  all  of  the  I/O  and  some  of  the  initialization  has 
been  sloughed  off  onto  this  assumed  main  program  and  does  not 
appear  in  subroutine  GLASMN. 

A. 7*2  A Parallel  Sort 

Sorting  is  a common  computer  application.  This  section 
demonstrates  an  in-core  sort  that  makes  use  of  all  processors  at  a 
reasonable  processor  utilization.  Seldom  are  the  items  to  be 
sorted  simply  numbers  to  be  sorted  by  magnitude;  however,  this  is 
the  easiest  example  to  use  to  show  how  the  algorithm  works.  The 
algorithm  starts  from  a state  in  which  the  items  to  be  sorted  are 
distributed  uniformly  among  the  processors*  "Processor”  could 
mean  either  processor  local  memory,  or  a piece  of  LM  address  space 
allocated  to  a specific  processor.  The  algorithm  will  work  Cor 
the  number  of  processors  (2^)  equal  to  any  power  of  2.  The 

example  will  be  given  for  a number  of  processors  equal  to  eight. 
The  starting  condition  for  the  example  is  given  in  Figure  A. 18. A 

The  succeeding  steps  in  the  algorithm  go  as  follows; 

1.  Sort  the  items  local  to  each  processor,  yielding  the 
state  of  Figure  A.18.B. 

2.  Determine  the  median  value  globally.  One  method  for 

doing  this  is  to  guess  at  a median,  and  then  count  how 
many  items  are  greater  or  less  than  this  guess.  The 

total  count  is  given  by  means  of  a SUMALL  function  on  the 
individual  processor  counts.  If  this  guess  is  not  close 
enough  to  the  median,  one  makes  a new  guess,  and  finds  a 
new  count.  This  procedure  iterates  until  a value  close 

enough  to  the  median  is  found.  Each  processor  divides 
its  pile  of  sorted  items  into  two  parts,  one  larger  than 

the  median,  one  smaller  than  the  median.  This  division 

is  marked  in  Figure  A.18.B. 

Swap  parts  between  processors  that  are  2^  (m  - n-1) 

apart.  The  lower  numbered  processor  of  each  swapping 
pair  sends  the  higher  of  its  two  parts  to  the  higlier 
numbered  processor,  and  the  higher  numbered  processor 
sends  the  lower  of  its  two  parts  to  the  lower  numbered 
processor.  After  the  swap  the  contents  of  the  various 

processors  are  like  F^'gat  ' A.18.C. 
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Figure  A. 16  FMP  FORTRAN  Version  of  Classman's  FFT  Algorithm 
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Figure  A. 16  PMP  FORTRAN  Version  of  Glassinan’s  FFT  Algorithm  (Cont’d) 
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Figure  A, 17  ALGOL  Version  of  Classman’s  PFT  Algorithm 
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Figure  A* 17  ALGOL  Version  of  Glassman*s  FFT  Algoritm  (Cont’d) 
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3.  Sort  again  within  each  processor. 

4.  Divide  the  range  over  which  the  median  is  to  be  found  in 
half,  and  find  separate  medians  for  each  half.  The  state 
is  now  Figure  A.18.D. 

5.  Decrease  m by  one,  that  is,  divide  the  swapping  distance 
in  half,  and  swap  again.  The  result  is  Figure  A.18.E. 

6.  Repeat  steps  3,  4,  and  5,  finding  medians  over  ranges 
which  are  divided  in  half  each  time,  and  swapping  over 
distances  which  are  divided  in  half  each  time,  until  the 
swappping  distance  is  reduced  to  one.  For  the  example 
with  eight  processors,  step  six  goes  only  once,  producing 
the  result  shown  in  Figure  A.18.G. 

7.  Sort  again  in  each  processor. 

Processor  utilization  depends  on  the  uniformity  v/ith  which  the 
data  is  distributed  among  the  processors  in  the  intermediate 
steps,  since  the  data  is  equally  divided  among  all  2^  processors 
both  at  the  beginning  and  the  end  of  the  algorithm.  As  an 

example,  consider  the  sorting  between  the  states  of  Figures  A.18.C 
and  A.18.D.  Assume  that  the  amount  of  time  taken  in  a single 
processor  is  proportional  to  NlogN.  Processot  No.  4 has  6 items, 
and  takes  a time  proportional  to  61og6.  Processor  0 has  2 items 
and  takes  a time  porportional  to  21og2*  The  total  time  spent 
working  is  proportional  to  56.66  while  the  longest  processor  time 
times  8 is  86.02,  giving  a processor  utilization  during  this  step 
of  65*9  percent. 

The  actual  FMP  has  512  processors.  Again,  in  the  first  and  last 
step,  data  is  uniformly  distributed  among  the  processors.  In  the 
intermediate  steps,  there  will  be  some  spread.  If  data  were 
randomly  distributed  among  the  processors  it  would  take  on  approxi- 
mately the  Poisson  distribution,  and  the  amount  of  data  in  the 
fullest  processor  could  be  estimated  from  that.  Given  that  there 
are  an  average  of  N items  per  processor,  in  the  Poisson  case  the 
processor  with  the  most  elements  would  have  about  N+3N'i  elements 
for  N large  enough.  For  N=10,  a table  of  the  Poisson  distribution 
shows  that  one  processor  in  512  is  expected  to  have  19.8  elements, 
whereas  the  approximation  for  N large  gives  19.5. 

Finally,  an  interesting  observation:  if  the  items  to  be  sorted 

happen  to  be  in  inverse  order,  it  turns  out  that  the  distribution 
among  processors  remains  uniform  through  the  entire  procedure,  and 
processor  utilization  is  100  percent. 
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APPENDIX  B 

FMP  CONNECTION  NETWORK  - ANALYSIS  AND  EVALUATION 


B*1  SUMMARY 

A connection  network  (CN)  is  to  stand  between  the  512  processors 
and  the  521  EM  modules  and  is  to  satisfy  these  requirements ♦ The 
connection  network  would  accept  requests  from  the  processors/ 
possibly  all  512  simultaneously  and  establish  connections  be-* 
tween  the  requesting  processor  and  the  requested  EM  module  at  EM 
memory  speed.  A crossbar  switch  between  processors  and  memory 
modules  can  provide  this  function^  but  at  a terrible  cost  in  hard- 
ware. It  has  crosspoints  where  N is  the  number  of  ports  along 
one  side. 

This  appendix  describes  a connection  network  based  on  the  Omega 
network  (described  below  and  in  [4] . The  Omega  network  has 
0(N1o92N)  components,  not  O(N^).  The  particular  network  which 
appears  to  best  satisfy  the  Connection  Network  requirements  is  a 
duplex  Omega  network,  providing  redundancy  for  additional  relia- 
bility, as  well  as  providing  the  required  function. 

The  appendix  is  arranged  in  the  following  sequence.  First,  some 
background  information  and  definitions  are  presented.  Second,  the 
advantages  and  disadvantages  of  providing  a CN  that  satisfies  the 
requirements  of  being  functionally  "almost”  a crossbar  switch  are 
presented,  especially  as  compared  to  the  original  TN  (1,2). 
Third,  various  candidate  versions  of  the  connection  network  are 
described  in  detail,  including  estimates  of  relative  hardware 
complexity  of  each.  Fourth,  Simulation  results  on  these  candi- 
dates are  presented,  obtained  by  a functional  simulator  of  the  CN 
and  by  a second  program  called  the  stochastic  analyzer.  Fifth, 
further  discussion  of  the  simulation  results  in  used  to  narrow 
down  the  selection  of  CN  to  one  or  two  of  the  cases  simulated  and 
analyzed.  Following  that,  there  is  a discussion  of  other  CN-relat- 
ed  topics,  including  some  of  the  design  details  that  were  disclos- 
ed by  the  simulation  results,  and  finally,  a paragraph  of  conclu- 
sions. The  conclusion  reached  is  that  sufficient  study  has  been 
completed  to  give  confidence  in  the  feasibility  of  the  Connection 
Network  in  the  FMP  architecture,  but  that  cost/performance  trade- 
offs deserve  to  be  further  considered. 

Discussion  of  the  simulators  and  analyzers  has  been  relegated  to 
Appendix  H. 

B.2  BACKGROUND 

The  connection  network  can  be  visualized  as  a circuit-sv/itched 
dial-up  network  in  which  up  to  512  callers  (the  processors)  are 
placing  short  calls  to  the  521  caliees  (the  EM  modules).  Connec- 
tions are  to  be  made  in  tens  of  nanoseconds,  and  held  Cor  a few 
hundred  nanoseconds.  Except  for  the  time  scale,  the  action  is 


like  that  of  the  telephone  network^  and  hence , the  design  of  such 
a network  starts  with  work  done  at  Bell  labs  (3).  The  work  of 
Duncan  Lawrie  (4)  is  especially  applicable# 

Many  similar  networks  have  been  developed/  but  which  have  been 
shown  to  be  topologically  equivalent  to  each  other  (5/9)^  One 
name/  the  ** Omega”  network  has  been  chosen  as  the  term  to  use  for 
any  of  this  class  of  networks#  The  Omega  network  is  shown  in  a 
form  called  the  ”baseline”  network  in  (9)# 

In  the  FMP  architecture  being  evaluated/  each  processor  computes 
its  own  address  in  EM.  There  is  no  central  location  where  the 
switching  pattern  for  the  entire  network  is  defined#  All  patterns 
of  connection  are  possible#  Since  connections  must  be  made  in 
tens  of  nanoseconds/  there  is  no  time  to  take  a global  look  at  the 
entire  pattern#  and  generate  a set  of  control  bits  for  the 
network#  Hence#  control  of  the  various  portions  of  the  network 
must  be  local  to  those  portions# 

Several  different  networks  have  been  investigated#  and  feasi- 
bility of  the  FMP  can  be  achieved  with  several  of  them#  The 
underlying  Omega  network  design#  on  which  the  preferred  versions 
are  based#  has  1024  ports  on  the  processor  side,  1024  ports  on  the 
EM  module  side#  and  ten  levels  of  nodes  in  between#  There  are 
1024  data  paths  connecting  one  level  to  the  next.  Each  level 
consists  of  512  two-by-two  switches#  which  are  described  in  more 
detail  below#  The  connections  between  nodes  exhibit  a pattern  of 
connections  designed  to  permit  < s many  processors  as  possible  to 
access  EM  modules  simultaneously  in  parallel  for  the  patterns  of 
accessing  which  occur  in  the  aero  flow  and  weather  codes# 

The  previously  described  transposition  network  (1#  2)  # was 

centrally  controlled#  and  required  two  ten-bit  control  settings# 
one  of  which  was  the  skip-distance  of  a p-ordered  vector  (defined 
below)#  The  transposition  network  consisted  of  two  barrel 
switches,  one  521  wide#  one  520  wide,  and  some  appropriate  wiring# 
Since  a barrel  switch  that  is  wider  than  512  but  not  more  than 
1024  wide#  can  be  built  of  five  levels  of  one-by-four  switches, 
the  TN  also  had  ten  levels  of  logic#  All  transfers  through  the  TN 
must  be  synchronized  to  the  control  settings#  and  only  those 
processors  whose  requests  fit  the  constant  skip-distance,  constant 
offset  description  could  execute  during  the  duration  of  a 
particular  control  setting# 


Names  indudei  "baseline  network”  (5)#  "binary  n-cube", 
"butterfly”#  "flip  network"#  "Omega  network"#  "reverse  baseline”# 
"simplified  data  manipulator”,  "hypertorus”#  and  ”SW  banyan 
network” ♦ 
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B.2*l  Definitions 


Certain  definitions  are  necessary  in  order  to  understand  the  rest 
of  this  appendix*  They  are; 

B.2.1.1  P-Ordered  Vector 

A p^-ordered  vector  is  a set  of  EM  addresses  such  that  the  address 
being  accessed  by  the  ith  processor  is  in  EM  module  number  (d  + 
p*i)  modulo  521  where  d is  called  the  "offset**  and  **p**  is  the 

**skip  distance*** 

B*2.1*2  P^Q-Ordered  Vector 

A p-q-ordej  ed  vector  is  a set  of  EM  addresses  such  that  the  EM 
module  number  being  accessed  by  the  ith  processor  is  (s  + p*i  + q* 
[i/kj  ) modulo  521  where  k is  the  **length  of  each  piece**  ^ s and  p 

are  **offset”  and  "skip  distance"  as  above,  and  q is  the  "distance 

between  pieces"*  The  bottom  brackets  represent  the  "largest 

integer  not  greatei  than". 

For  a system  where  there  are  16  processors  and  17  EM  modules,  an 
example  of  a p-ordered  vector  would  be  for  the  16  processors  to 
request  access  to  the  following  memory  modules  respectively: 

10  13  16,  2,  5,  8,  11,  14,  0,  3,  6,  9,  12  15*  1,  4 

where  the  offset  is  10,  and  the  skip  distance  is  3.  For  this  same 
system,  a p-q-ordered  vector  with  p equal  to  1.  and  five  elements 
per  piece,  the  processors  might  be  requesting  from  the  following 
EM  modules  respectively: 

11,  12,  13,  14,  15,  3,  4,  5,  6,  7,  12,  13,  14,  15,  16,  1 
In  this  case,  p is  1,  q is  4,  and  the  length  of  the  piece  is  5* 
Numbers  a) e interpreted  modulo  17. 

B.2.1.3  Random  Request 

A set  of  EM  addresses  such  that  the  EM  module  being  accessed  by 
the  ith  processor  is  a random  variable,  from  0 through  520,  which 
is  independent  of  the  module  number  being  accessed  by  any  other 
processor . 

B.2.1.4  Blockage 

Blockage  is  the  result  when  two  requests  try  to  share  the  same 
path  in  the  connection  network,  which  can  then  only  supply  a path 
for  one  of  them. 


B*2*1.5  Conflict 


Conflict  is  more  than  one  processor  accessing  the  same  EM  module 
simulaneously. 

B.2.1,6  Pileup 

Pileup  is  the  number  of  processor  having  a conflict  at  a given  EM 
module* 

B.2*l*7  Frame 

A frame  is  a parcel  of  data  of  fixed  size  sent  over  a transmission 
path.  In  the  CN,  each  frame  is  11  bits?  five  successive  frames 
make  a data  word. 

B . 3 ADVANTAGES 

The  advantages  of  the  CN  over  the  previously  studied  Transposition 
Network  (TN)  (1,  2)  are  simplification  of  user  programming,  simpli- 
fication of  the  compiler,  improved  performance,  and  broader 
spectrum  of  applications. 

Compile?  simplification  arises  because  each  processo)  computes  EM 
addresses  independently  of  the  other  processes.  The  compiler 
need  not  be  aware  of  the  relationships  between  those  add? esses. 
No  code  is  emitted  to  compute  offsets,  or  skip  distnaces,  or  to 
control  how  many  LOADEM  instructions  are  issued.  No  rest) ictions 
need  be  imposed  on  subscript  expressions.  All  of  these  i epresent 
simpl ications  of  the  situation  for  an  FMP  using  the  pi  eviously 
studied  TN,  where  the  compiler  would  have  had  to  oeate  an 
alternate  branch  with  dummy  LOADEM  instructions  to  keep  synchroni- 
zation, even  when  a given  processor  will  skip  all  actual  compu- 
tation in  a section  of  code  containing  EM  accesses.  The  connec- 
tion network  (CN)  does  not  require  any  synchronization,  and  thus 
eliminates  all  dummy  LOADEM  instructions. 

When  the  various  instances  of  the  DOALL  fetch  a set  a?  ray 

elements  that  do  not  form  a set  of  linearly  spaced  elements,  no 
user  precautions  and  no  analysis  by  the  compiler  are  required. 
Examples  of  nonl inear ly  spaced  elements  are  the  wraparound  on 
longitude  in  the  GISS  weather,  and  the  offsetting  of  the  index  J 
in  subroutine  CHRVAL  of  the  2D  MacCormack  aero  flow  code.  In  the 
baseline  system  these  would  not  have  been  allowed.  The  use?  would 
have  had  to  vectorize  them.  The  independent  programming  of  the 
processor  can  make  the  FMP  more  than  merely  a vector  machine,  so 
this  )estrictjon,  imposed  by  the  TN,  represented  an  incompata- 
bility  with  the  system  objectives. 

When  the  Connection  Network  is  used,  system  performance  would  he 
affected  much  less  by  problem  size  than  when  the  TN  is  used.  Ko 
example,  consider  the  fetching  of  data  subscribed  with  the  domaii> 
variables  in  two-dimensional  DOALLs.  Say  the  subscripts  a) e i,  J 
and  K.  Within  the  DOALL  over  I and  J,  fetches  of  an  array 


A(I/JfK)  from  p-ordered  vectors  with  p equal  1.  Within  the  DOALL 
over  3,  and  K,  fetches  of  A(I,J^K)  form  p«ordered  vectors  with  p 
equal  to  IMAX.  Within  the  DOALL  over  I and  K,  fetches  of  A(I,J,K) 
form  p-q-ordered  vectors  with  p equal  1 and  q equal  to  IMAX*(JMAX 
- 1).  With  the  TNf  all  p-ordered  vectors  are  fetched  in  one 

LOADEM,  but  a fetch  of  a p-q-ordered  vector  required  that  each 
piece  of  a p-q-ordered  vector  be  fetched  with  a separate  LOADEM 
operation,  or  512/IMAX  fetch  operations  just  to  load  one  datum  per 
processor.  With  the  new  CN^  the  number  of  EM  cycles  required  for 
a p-q-ordered  vector  is  controlled  by  the  largest  pileup,  usually 
a much  smaller  number  than  the  number  of  LOADEM  instructions 
needed  with  the  TN.  The  pileup  for  p-q~ordered  vectors  is  discus- 
sed further  in  Section  B.7.4^  together  with  some  simulation 
r esults. 

For  the  specific  example  that  comes  from  the  3D  explicit  code 
given  us  by  NASA,  in  the  smaller  than  normal  mesh  size  of  31  x 31 
X 31,  the  improvement  is  dramatic,  from  17  LOADEM  instructions 
required  to  fetch  A(IrJ,K)  over  I and  K,  to  a maximum  pileup  of 
depth  2.  Simulation  of  this  case  showed  that  all  processors  re- 
ceived their  data  within  two  EM  cylces. 

The  following  development  shows  this  advantage  of  CN  ovei  TN,  in 
analytic  form.  The  slowest  processor  is  the  one  holding  up  the 

synchronization  at  the  end  of  a DOALL.  If  access  limoK*  were 
nor  mally  distr  ibuted  with  mean  and  standar  d deviatio  ; S,  then 

the  worst  total  of  N access  tiroes,  out  of  512  such  totals  (i 
tne  fn-'St'-delayed  processor)  would  have  a value  given  in 
Equation  B.l. 


Max  Delay  = h*'Tpy+  3*N^*S  (B.l) 

Because  of  the  central  limit  theorem,  this  formula  is  valid  for 
large  enough  N without  any  need  for  assuming  an  underlying  normal 
distribution.  Equation  B.2  gives  the  cor r:esponding  formula  for 
the  old  TN. 


Max  Delay  = N Tj^^x 

'i'max  however  many  LOADEM  instructions  are  re- 
quired oei’  fetch  and  may  be  many  times  T^y.  The  reason  for  the 

improvement  of  equation  B.l  over  B.2  is  that  synchronization  among 

processors  for:  EM  accessing  is  not  required  with  the  CN,  so  that 

each  processor  continues  executing  without  any  wait  for  the 
slowest  processor. 

A possible  wait  would  exist  only  at  the  end  of  a DOALL  where  data 

precedence  may  force  a synchronization.  N in  equations  B.l  and 

b.2  is  the  number  of  accesses  between  such  waits. 
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A substantial  gain  in  user  convenience  would  be  achieved  with  the 
CN,  All  tricks  such  as  adding  dummy  instances  to  the  DOALL  to 
make  the  domain  size  equal  to  the  array  extents  are  unnecessary. 
There  aj  e no  “magic”  array  extents  or  DOALL  domain  sizes  for 
making  the  third  direction  have  the  same  speed  as  the  other  two 
with  the  CN.  Likewise#  all  need  to  distort  the  algorithm  to 
regularize  the  subscript  exp)  essions  would  disappear  . The  code 
shown  in  Figure  B.I  is  an  FMP  FORTRAN  version  of  some  statements 
abstracted  from  subroutine  CHRVAL  in  a 2b  explicit  code  given  to 
Burroughs  during  the  previous  study.  The  subscr ipt  ”J  + OKFSBT” # 
being  the  result  of  data  dependent  computations,  would  have  been 
disallowed  in  the  originally  proposed  FMP  FORTRAN  (1,  2)  because 
of  the  restrictions  imposed  by  the  TN.  Such  a subscript  is  per- 
fectly proper  in  the  currently  described  FMP  FORTRAN.  The  many 
awkward  and  arbitrary  restrictions  on  the  language,  imposed  by  the 
access  pattern  limitations  of  the  old  TN,  are  not  require<i  in  a 
system  using  the  proposed  Connection  Network.  Any  integer  expres- 
sion can  be  used  as  a subscript. 

B.4  CN  DESCRIPTION 

The  Connection  Network  (CN)  has  two  modes  of  operation.  First, 
wnen  the  processors  are  independently  operating,  it  would  provide 
a path  from  any  given  processor  to  the  EM  module  of  that  proces- 
sor's choice,  without  regard  to  any  other  (up  to  511)  connections 
from  other  processors  to  other  EM  modules.  Second,  certain  func- 
tions would  be  performed  in  synchronism,  because  these  functions 
are  much  more  economical  to  implement  when  the  processors  are 
synchronized.  This  second  class  of  functions  would  be  done  under 
coordinator  command,  at  a time  when  all  processors  are  in  synchron- 
ism. This  second  class  includes 

* Broadcast  from  coordinator  to  processors 

* “Harvest”  data  from  processors  to  coordinator 

* Broadcast  from  one  EM  module  to  all  processors 

(FETCHEM) 

* Swap  data  between  pairs  of  processors 

Various  CN  design  options  are  based  on  either  a Benes  or  Omega 
network.  The  Benes  can  make  any  permutation  of  connections  be- 

tween processors  on  one  side  and  EM  modules  on  the  other,  but  only 
at  the  cost  of  having  each  connection  a function  of  the  connectiv- 
ity of  all  others.  Opferman  and  Tsao-Wu  (6)  show  that  the 
computation  of  this  "perfect"  connection  takes  on  the  ordei  of  N^ 
computational  steps  (or  N log2(N)  if  a content  add) essible  memory 
is  used).  Thus,  for  making  connections  in  nanoseconds,  it  is  not 
possible  to  compute  the  control  settings  of  a Benes  networ  k for 
each  set  of  now  EM  addresses.  Instead,  each  node  of  th'-  networ  k 
determines  its  own  setting,  based  on  some  very  simple  computation, 
with  sufficient  redundancy  that  a path  for  sufficiently  many  of 
the  processors  is  set  up  in  the  desired  access  time,  with  random 
distribtuion  of  the  excess  time  among  all  processors. 
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DOhhh,  K«1,KM 

1 statements 

IP  (condition)  GO  TO  10 

OFFSET  = OFFSET  + 1 
. * • statements  * , . 

IF  {DYX(J,K)  LTO)  OFFSET  « OFFSET  - 1 
IF  (J  + OFFSET  .GT.  1)  GO  TO  1 
• • . St  tements  . * . 

10  statements  •*. 

Oyx(J  + OFFSET,  K)  ==  expression 
ENDDO 


Figure  B.l  FMP  FORTRAN  of  Portions  of  Subroutine  CHRVAL 
of  2D  Explicit  Code 


Out  of  several  candidate  designs^  simulations  have  been  used  to 

indicate  the  most  efficient,  i.e*,  fastest  access  time,  lowest 
occurrence  of  blocking,  and  smallest  parts  count* 

Figure  B.2  shows  the  first  variation  considered*  In  this  case,  a 
1024-wide  Benes  network  (only  some  of  the  edge  nodes  are  shown) 
has  the  first  512  ports  on  the  left  attached  to  the  512 

processors,  and  the  first  521  ports  on  the  right  attached  to  the 
521  EM  modules*  Detailed  examination  shows  many  of  the  nodes  can 
never  he  used.  The  middle  level  must  have  a full  512  nodes  to 

switch  iuts  1024  data  paths  (at  two  paths  per  node).  In  the 
remaining  nine  levels  not  all  the  data  paths  can  reach  any  EM 

module,  so  that  only  some  of  the  data  paths  need  be  implemented. 
Parts  counts  can  be  derived  from  Table  B.l  which  gives  the  number 
of  paths  required  at  each  level  of  nodes: 

TABLE  B.l 

Width  vs.  Layer  Number 


Layer  No.  Number  of  paths  (=2  x nodes) 


10 

11 

12 

13 

14 

15 

16 

17 

18 
19 


1024 

1024 

768 

640 

576 

544 

528 

528 

524 

522 


On  the  side  with  processor-  ports,  the  512  processors  can  all  fit 
into  a 256  node,  512-wide  path,  with  the  result  that  of  the  1024 
paths  to  the  middle,  half  are  unused.  Figure  B.3  shows  a smaller 
example  with  8 processors  and  11  EM  modules. 


Bach  node  is  a 2 x 2 crossbar  switch  and  is  descr  ibed  in  detail  in 
Section  B.4.2.  When  paths  from  individual  processors  to 
individual  EM  modules  are  set  up  (the  "normal”  mode  of  operation), 
each  node  connects  in  either  one  of  two  ways: 

1.  Processor-side  port  A to  EM-side  port  X,  also 
B to  Y (straight-through). 

2.  Processor-side  port  A to  EM-side  port  Y,  also 
B to  X (cr  ossed) . 


As  long  as  mode  of  operation  is  "normal",  only  one  bit  of  in- 
formation is  required  to  determine  the  setting  of  a node^  When 
only  one  processor -side  port  has  a pending  request,  that  port 
provides  the  bit  of  cont)  ol  information.  When  both  po)  ts  have 
pending  requests,  one  port  must  be  chosen  to  provide  the  bit  of 
control  information.  That  port  is  said  to  have  "priority"  over 
the  other  one. 

Each  node  determines  its  setting  from  this  bit  as  follows*  If  the 
bit  is  ONE,  the  port  with  the  request  is  connected  to  the  lower 
EM-side  port;  if  the  bit  is  EERO,  the  input  port  with  the  request 
is  connected  through  to  the  upper  EM-side  port.  The  control  bit 

is  one  of  the  bits  of  the  port  number  on  the  EM  side.  The  middle 
level  of  nodes  uses  the  most  significant  bit  of  output  port 
number  , the  two  levels  on  either  side  of  the  middle  use  the  next 
to  most  significant  bit,  and  so  on  to  the  first  and  last  node 
levels  which  use  the  least  significant  bit. 

B.4.1  Ver  sions  of  Networ  ks  Considered 

Several  variations  on  this  idea  have  been  devised  and  simulated. 
Figure  B.4  is  a revised  version  of  Figure  B.2,  showing  the  entire 
network,  but  eliminating  the  detailed  depiction  of  each  individual 
4node.  It  is,  as  has  been  pr:eviously  noted,  isomorphic  to  a 

base-2  Benes  network  with  some  of  the  nodes  omitted.  Processor* 
ports,  and  EM  ports,  are  each  packed  into  the  first  512  network 
ports  oa  both  sides* 

Figure  B*5  shows  the  processors  spread  across  every  other  port  at 
the  left  side  of  a 1024-wide  network.  The  additional  nodes  hope- 
fully provide  some  redundancy  in  the  connectivity. 

Figure  B.6  shows  both  processors  and  EM  modules  spread  across  a 
1024-wide  Benes.  To  simulate  the  spreading  of  EM  modules,  trans- 
form module  number  M into  a new  module  number  M*  as  in  equation 

B.3. 

M*  = 2M  0$  M^  611 

M*  = 2(M-512)+1  5124  520 

These  expressions  result  when  M is  shifted  left  end-around  one  bit 
position. 

Figure  B*7  shows  the  same  number  of  nodes  as  in  Figure  B.6 
arranged  as  two  second-halves  of  the  Benes  network.  Duncan  Lawr ie 
calls  such  a second  half  an  "Omega  network"  [4].  The  idea  is  that 
if  an  access  is  not  granted  through  the  upper  Omega,  the  processor 
could  try  a few  nanoseconds  later  through  the  bottom  Omega. 
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Layers , 
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D.4.2  Logic  Design 

Figure  B,8  shows  the  basic  node  of  any  version*  Two  bidirectional 
12-bit~wide  paths  connect  to  each  side*  On  the  processor  side 
they  are  labelled  A and  B,  and  the  EM  side  X and  Y.  Inte)  nal 
connections  may  be  made  from  A to  either  X or  Y,  and  from  B to 
either  X or  Y*  The  12  bits  from  processor  to  EM  are  used  for  EM 
module  number  +1  bit  plus  "strobe”  when  transfers  are  going  on* 
The  twelve  bits  returning  to  the  processor  are  11  bits  of  data 
plus  a "latchup”  bit*  The  ”latchup”  bit  is  a command  to  the  node 
to  keep  this  path  connected*  The  "latchup"  bit  would  be  trans- 
mitted  from  the  EM  module  upon  recognizing  a valid  request  coming 
from  the  CN,  and  serves  to  keep  the  path  connected  as  long  as 
”latchup”  is  true* 

Logic  in  the  CN  buffer  of  the  processor  uses  latchup  as  the 
"acknowledge"  that  signals  that  a request  has  been  gianted* 
Latchup  could  be  dropped  by  the  EM  module  after  the  operation 
being  performed  ceases  to  need  the  data  path.  Alternatively/ 
timing  could  occur  in  the  CN  buffer'/  and  the  dropping  of  strobe 
could  be  the  signal  to  the  EM  to  drop  latchup. 

The  resting  state  is  shown  in  Figure  B*9*  "Requests"/  consisting 
of  EM  module  numbers/  may  or  may  not  be  coming  out  of  the 
processors/  and  the  connectivity  of  the  node  is  set  up  according 
to  the  specified  function  of  module  number  bit  and  port  bit  (A  vs* 
B)*  The  "latchup"  bit  coming  back  from  the  EM  modules  is  false. 
Connectivity  is  switched  as  fast  as  the  requests  change/  since  the 
initial  path  connection  is  pure  combinational  logic.  The  command 
lines  1 rom  the  CU  have  a “null”  command* 

At  some  time  one  of  the  requests  finds  its  way  to  the  correct  EM 
module/  which  then  emits  a "latchup"  pulse*  Other  processors  must 
not  disrupt  the  chosen  path  before  it  is  latched.  Therefore/ 
there  is  a "CN  clock"  in  all  processors/  with  a period  longer  than 
the  round*“trip  time  of  the  CN,  so  that  new  requests  are  emitted 
only  after  old  requests  have  had  a chance  to  latch  up*  The  round 
trip  delay  is  about  40  feet  of  wire/  plus  19  gates  worth  of  logic 
delay  going  out  and  19  gates  worth  of  logic  delay  coming  back.  If 
gate  delays  are  1*5  ns  (including  some  allowance  for  wiling  on 
(the  boards)/  and  wire  is  1*6  ns/ft  (Teflon  or  polyester  belts)/ 
this  delay  is  120  ns  and  sets  a lower  limit  to  the  CN  clock 
period*  This  clock  is  a second  timing  signal  to  each  p ocessor 
(the  first  is  the  main  clock)/  not  a countdown  of  the  cloctt  within 
the  processor*  This  timing  signal  selects  every  Nth  puls?*  of  the 
main  clock* 

Figure  B*10/  B*ll  and  B*12  show  various  latched-up  states.  Figure 
B.ll  shows  the  "straight-through”  connection/  figure  B,12  shows 
the  "crossed"  connection. 
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Figure  B*ll  Latchup  State,  Both  Paths,  Both  Transferring  to  EM 


Figure  B*12  Latchup  State,  Both  Paths,  Crossed  Connectivity 


B.4*3  CN  Function  Controls 


-t- 


The  Connection  Network  serves  other  interconnection  functions  in 
the  proposed  system  besides  the  processor-BM  paths.  Other 
functions  are  controlled  from  the  coordinator,  A list  of  CN 

states  defined  by  the  coordinator's  control  is  the  following 
paragraphs , 

B,4.3,l  "BDCST/HVST” 

The  ”BDCST/HVST*’  command  makes  a connection  from  both  A and  B to 
both  X and  Y,  Data  from  the  CR  enters  all  nodes  at  Y,  and  by 

fanning  out  to  both  A and  B,  will  reach  all  processors.  Data  from 
the  processors  enters  at  both  A and  B,  and  is  either  ORed  o)  ANDed 
(it  does  not  matter  which)  to  be  combined  at  the  Y^-port  that  the 
coojdinato;:  listens  to.  This  command  is  used  for  FETCHEM  as 

described  below,  and  for  RVST, 

B.4,3,2  "Null" 

With  the  coordinator  (CR)  node  command  tu)  ned  off,  the  node 
carries  out  its  wired*-in  function  of  passing  on  requests,  and 
latching  up  for  the  "latchup"  signal  from  the  EM  module  as 
previously  described, 

B . 4 . 3 , 3 "Wraparound" 

Connect  port  A to  port  B,  fhis  implements  the  SHIPCN  function  in 
this  CN,  When  the  Nth  level  has  a wraparound  command,  then  every 
orocessor  is  connected  to  the  processor  whose  CN  port  number  is 
different  in  the  Nth  bit,  N is  counted  from  the  left  in  both 

Figure  B,4,  and  in  Figure  B.7,  The  "wraparound"  coinmand  is  used 
for  processor-to-pi  ocessor  data  swapping.  Depending  on  in  which 
of  the  ten  levels  of  the  CN  is  the  node  getting  the  "wraparound" 
command,  data  will  be  swapped  between  two  CN  ports  which  differ 
only  in  one  bit  of  theii’  number.  Normally,  all  nodes  of  the  same 
level  get  the  wraparound  command,  with  the  result  that  all  CN 
ports  swap  data  with  those  ports  that  differ  by  just  one  bit  in 
the  specified  bit  position.  SUMALL  for  example,  can  be 
implemented  by  swapping  data  that  is  just  one  apart  on  port  number 
("wraparound"  on  the  least  significant  bit)  and  adding,  then  by 

swapping  that  sum  two  apart  and  adding  four  apart,  and  so  on  up  to 

256  apart, 

B.4,3,4  Diagnostic  Commands 

As  descrioed  in  Chapter  6,  Section  6.1,  the  individual  Omega 

networks  (layers)  of  the  two-Omega  network  must  be  tested 
separately  for  diagnostic  purposes.  Thus,  we  need  a command  to 
disable  one  Cir.ega  network  while  testing  the  other  one.  See 
Section  6,1  for  additional  details. 
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B.4.3.5  FETCHEM 


The  FETCHEM  instruction  is  implemented  in  two  steps*  First,  the 
EM  module  number  is  sent  from  the  coordinator  as  a normal  request 
after  the  processors  are  synchronized  (in  order  to  ensure  that 
processors  are  not  making  requests  of  their  own).  This  request  is 
accompanied  by  a command  code  to  the  EM  module  that  causes  reading 
without  sending  any  latchup. 

During  the  access  time  of  the  EM  module,  the  coordinator  turns  the 
etnire  CN  on  with  the  "BDCST/HVST”  coimand.  Data  from  the 
selected  EM  module  is  therefore  broadcast  to  all  processors* 
Inactive  EM  modules  emit  zeroes  to  be  ORed  with  the  data  (or  ONEs 
to  be  ANDed  with  the  data,  depending  on  how  the  nodes  are 
implemented ) * 

B*4.3*6  HVoT 

The  HVST  instruction  is  implemented  by  the  coordinator  setting  the 
CN  to  the  "BDCBT/HVST"  state,  at  a time  when  the  CN  buffers  are 
"full"  and  the  processors  otherwise  idle,  and  then  issuing  ”go”, 
which  is  the  command  being  expected  by  the  CN  buffer  for  dumping 
the  data  into  the  CN*  The  data  arrives  at  all  EM-side  ports, 
including  the  port  that  delivers  the  data  to  the  coordinator* 
HVST  is  intended  to  be  used  primarily  for  the  case  that  only  one 
processor  is  enabled,  therefore,  it  does  not  matter  whether  that 
data  is  ORed  with  zeros,  or  ANDed  with  ones,  in  the  CN*  The 
result  is  that  it  shall  be  left  as  logic  designer’s  option  whether 
the  vyords  combined  during  "BDCST/HVST”  are  ORred  or  AI^Ded* 

B.4*3.7  Coordinat'jr  Access  to  EM 

CR  fetches  and  stores  from  EM  are  no  different  from  processor 
LOADEMs  and  STOREMs*  The  CR  has  its  own  CN  buffer* 

B*4*3*8  CN  to  Coordinator  Status 

Each  node  emits  "busy”  ~ 1 whenever  one  of  its  two  paths  is 
latched  up*  The  condition  that  no  node  be  "busy”  is  necessary 
before  the  CN  can  switch  to  some  other  command*  Now  it  may  be 
possible  for  the  CR  to  tell,  from  the  state  of  synch  of  the  array, 
when  the  CN  is  idle,  so  there  may  be  no  need  for  the  "busy”  bit* 
Until  the  rest  of  the  design  is  finalized,  however,  a "busy"  bit 
is  assumed* 

B * 4 * 4 Implementation  Details 
B*4.4.1  Flip  Flops 

Two  alternative  design  approaches  for  the  node  are: 

I*  No  flipflops,  just  logic  that  is  latched  up  ly  the 
"latchup"  signal,  as  described  above. 
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2.  Path«-hciuin%  flipflops  that  are  clocked  by  the  same  120  nc 
CN  clock  that  lo  seen  as  needed  foj:  timing  the  processor  requests. 
These  flipflrp«  hold  the  path  for  just  one  CN  cloc‘:  period. 

Approach  2 h.u  more  gates,  but  permits  faster  access  to  extended 
memory.  Figure  B.13  shows  the  timing  diagram  for  the  two  cases. 
In  case  1,  where  the  node  is  combinational  logic  only,  the  EM 
module  nur/ber  contained  in  the  "request”  must  be  held  statically 
by  the  p)  '>cessor  until  the  "latchup"  signal  returns  from  the  EM 
module.  Tnen,  and  only  then,  would  the  processor  be  free  to  emit 
an  add) ess  tow^) d the  EM  module  over  the  now  latched-up  path. 

In  approach  2,  the  processor  emits  the  request  followed  by  two 
flames  of  address  plus  operation  code.  Each  frame  is  11  bits.  The 
node,  seeing  one,  or  two,  requests  on  ports  A and  B,  sets  the 
flipflops  with  che  CN  clock,  so  that  the  address  can  continue  down 
the  path,  if  it  is  possible  to  reach  the  EM  module  for  this  node. 
The  EM  modul ? gets  its  addiess  about  80  ns  sooner  than  it  does  in 
approach  1,  cutting  80  ns  off  the  access  time.  These  flipflops 
will  not  stay  up  without  the  "latchup"  signal  coming  back  before 
the  .''ext  clock,  thus,  if  the  EM  module  is  not  reached,  a new  path 
is  to  be  set  up  on  the  next  CN  clock. 

B.  4 >4. W*  I ing 

Ei,ch  is  controlled  by  one  and  only  one  bit  of  the  EM  module 

nu-iiber  in  the  jequest  of  the  port  with  priority.  Since  all  nodes 
are  to  r,e  physically  identical,  the  control  bit  must  show  up  at 
the  Sdiiie  physical  location  in  each  node.  Thus,  previous  nodes 
must  have  a wiring  pattern  for  the  bits  it  passes  along,  such  that 
they  show  up  in  the  control  bit  position  aftei  passing  thiough  the 
correct  number  of  nodes.  Figure  B.14  shows  such  a wiring  pattern 
for  a 32-wida  CN,  such  as  might  be  appropriate  foj-  16  piocessors 
and  17  memories.  Figure  B.15  shows  the  first  few  levels  of  the 
512/521  network,  showing  the  connections  from  X or  Y output  from 
one  level  to  the  A o)  B inputs  of  the  next.  Since  the  inter  level 
cables  are  belts,  where  wires  must  lie  parallel,  and  since  all 
nodes  are  to  be  identical,  these  crossovers  occur  on  the 
paddleboards , not  in  the  belts  or  within  the  nodes.  (Similar 
of fset*-by-one  wiring  patterns  are  seen  on  some  of  the  Illiac  IV 
paddleboards . ) 


B.4.^.3  ucgic 


Each  node  crntains  two  12-wide  two-way  selector  gates,  one  for 
each  of  X and  Y outputs,  and  two  12- wide  three-way  selecto)  gates, 
one  for  each  of  A and  B.  The  third  input  would  take  care  of 
wraparound.  Each  node  also  contains  some  decision  making  logic. 
The  inputs  to  the  decision  making  logic  are; 
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Figure  B,14  Wiring  Crossover  Map^  16  Processors  To/From 
17  EM  Modules 


Figure  B.15  Wiring  Crossover  MapS/  Full  Size  System 
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* 1 bit  of  the  EM  module  number  (for  every  node,  this  is  the 
least  significant  bit  of  the  frame  seen  at  that  node 
because  of  the  wiring  patterns  of  Figure  B*12) 

1 bit  to  control  priority 

* 1 bit,  the  11th  bit  of  the  "request”  frame  could  be  used 
for  calling  for  processor-^to-processor  wraparound 

3 bits,  commend  from  coordinator 

1 "Latchup"  bit 

* 1 "strobe”  bit,  bit  12  of  the  processor-to- EM  path 
(1  CN  clock,  if  design  2,  with  flipflops,  is  used) 


Asterisked  items  occur  on  both  ports  (either  A and  B,  or  X and  Y) , 
leading  to  a total  count  of  12  signals  that  have  to  be  combined  in 
the  combinatorial  logic.  The  logic  has  the  following  output 
signals;  Select  A or  B or  both  or  no~output  for  X,  select  A or  B 
or  both  or  no-output  for  Y,  select  X or  Y or  B for  A,  select  X or 
Y or  A for  B,  "busy".  Only  16  logically  different  output  signals 
are  needed.  "No  output",  or  all  lines  FALSE,  is  substituted  for 
an  input  request  that  cannot  reach  its  destination. 

FeierbacU  and  Stevenson  [5]  recommend  the  following  algorithm  for 
determining  priority:  If  the  request  at  A and  the  request  at  B 

can  both  be  satisfied,  then  the  node  is  set  to  either  straight- 
through  or  crossed  connection,  whichever  is  requested;  if  the 
requests  conflict,  the  node  is  set  to  straight-through,  which  will 
be  correct  for  one  of  the  requests.  Thus,  A has  priority  if  its 
request  bit  is  zero;  B has  priority  if  its  request  bit  is  one. 
This  algorithm  introduces  no  bias  against  either  A or  B,  but  means 
that  certain  memory  modules  will  be  more  easily  accessible  from  A, 
and  other  memory  modules  will  be  more  easily  accessible  from  B. 
If  memory  addressing  averages  out  in  some  sense,  then  this 
algorithm  is  unbiassed* 

The  priority  rule  used  successfully  in  validating  the  CN  goes 
thusly,  for  the  double  Omega.  In  the  upper  Omega  network,  the 
upper  port  of  each  node  has  priority;  in  the  lower  Omega  network, 
the  lower  port  of  each  node  has  priority.  Early  simulation 
results  showed  priority  could  not  be  left  to  chance,  and  all 
double  Omega  simulation  results  reported  in  the  next  Section,  B.5, 
were  done  with  the  priority  according  to  this  rule.  For  the 
single  Omega  network,  the  priortiy  was  alternated  each  CN  clock 
period,  and  there  were  an  odd  number  of  clock  periods  per  EM 
cycle.  This  is  slightly  more  complicated  than  Feierbach  and 
Stevenson's  rule,  but  was  judged  to  be  less  biassed. 


B*4*4.4  Parts  Count 


The  node's  parts  county  or  at  least  the  gate  county  is  dominated 
by  the  selection  gates ^ three-input  selection  gates  for  processor 
directed  signals  going  back  out  of  parts  A and  and  two-input 
selection  gates  for  EM-bound  signals  out  of  ports  X and  Y Using 
the  Benes  network  with  processor  ports  packed  into  the  first  512 
prossor-side  ports,  and  with  EM  ports  packed  in  the  first  521  as 
an  example,  the  parts  count  goes  as  follows*  The  first  nine 
levels  have  512  nodes  each.  For  the  last  ten  levels,  we  can  count 
the  number  of  nodes  per  level  from  the  data  in  Table  B.l.  The 
number  of  nodes  is  just  half  of  the  number  of  paths  passing 
through  a given  level.  Adding  together  the  19  numbers  )epresent- 
ing  the  number  of  nodes  at  each  level,  gives  a total  of  5643 
nodes.  At  each  node,  there  are  two  ports,  and  a data  path  that  is 
12-wide,  with  3-input  gates  in  one  direction  and  2-input  gates  in 
the  other.  Computation  gives 

5643  X 12  X 2 ==  135,432  3-way  1-input  selection  gates,  plus 
135,432  2-way  1-input  selection  gates,  for  a 
total  of  270,864  selection  gates 


By  comparison,  the  Transposition  Netwoik  in  the  pieliminaiy  study 
[1,  2]  consists  of  two  barrel  switches,  each  bidirectional  and  9 
bits  wide  in  both  directions.  If  these  barrels  were  to  be  imple- 
mented with  2-way  1- input  selection  gates,  they  would  have 

(520  + 521)  X 9 wide  x 10  levels  x 2 directions  = 187  380 

2-way  1-input  selection  gates 

Parts  count  of  the  other  variations  differ  accordingly. 

If  the  same  level  of  integration  can  be  achieved  in  both  designs, 
then  the  Benes  network  of  Figure  B.2  should  take  more  chips  than 


the  TN  by  about  the 
selection  gates,  or 

ratio 

of 

the  number  of  inputs 

in  these 

135,  336  X 2 

+ 135, 

336 

X 3 

187, 

J80  X 2 

= 180.6% 

To  such  a node  count 

must 

be 

added  some  additional 

increase 

because  of  the  combinatorial  logic  in  each  node,  some  addition- 
al increase  because  of  the  additional  processor  interface  requir- 
ed, and  if  the  two-layer  Omega  network  of  Figure  B.7  is  used  some 
additional  increase  in  the  EM  module  to  resolve  conflicting 
accesses  arriving  from  the  two  redundant  networks.  On  the  other 
hand,  the  network  controls,  simple  enough  in  the  baseline  design, 
are  even  simpler  here  since  most  control  is  local  to  the  node* 
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Some  of  the  increase  in  size  is  due  to  the  increase  in  width,  from 
9 lines  in  the  baseline  design  to  12  lines  here*  This  increase  is 
necessary  since  the  EM  module  number  plus  a strobe  must  be  trans- 
mitted in  parallel*  However,  this  also  would  give  some  speed  ad- 
vantage; a data  word  being  transmitted  in  5 frames  instead  of  7 
bytes* 

If  only  one  Omega  netwo)k,  one  sheet  of  Figure  B*7  is  implemented, 
then  some  means  of  eliminating  the  bias  against  certain  processors 
must  be  adopted*  Alternating  the  priority  on  a regular  cycle  has 
been  simulated,  and  appears  to  be  satisfactory*  The  suggestions 
of  Peierbach  and  Stevenson  (5)  when  adapted  to  this  network  appear 
to  eliminate  bias  more  economically,  but  perhaps  with  side 
effects;  they  have  not  been  simulated* 

Another  variation  on  the  two  sheet  layer  Omega  network  would  be  to 
provide,  at  each  node,  a path  for  data  to  go  up  or  down  to  the  cor- 
responding node  of  the  other  sheet*  To  the  node  logic  of  Figure 
B.8,  a path  would  be  added  from  port  B of  the  other  node,  entering 
into  output  gates  X and  Y,  as  well  as  a path  from  both  X and  Y of 
the  othei'  node,  entering  the  outputs  at  port  B*  Port  B is  selec- 
ted on  the  basis  that  it  is  the  low  priority  port;  the  high  prior- 
ity port  will  always  find  a path  on  its  own  sheet*  With  a two- 
sheet  Omega,  one  probably  uses  fixed  priorities,  favoring  the  A 
port  on  one  sheet,  and  the  B port  on  the  other  sheet*  However, 
the  hardware  on  both  sheets  can  be  identical  because  of  symmetry* 
This  variation  increases  the  number  of  inputs  of  selection  gates 
f)’om  ton  to  14.  Since  data  paths  dominate  the  hardware,  this  is 
approximately  a 40%  increase  in  gate  count  for  the  entire  CN  to 
provide  these  additional  paths. 

B.5  SIMULATION  RESULTS 

B.5.1  Summary 

Various  CN  configurations  were  simulated  with  the  functional  simu- 
lator and  the  simulator  results  were  studied  for  various  indica- 
tors of  goodness  of  function.  Test  cases  consisted  of  filling  a 
queue  of  requests  in  each  processor.  In  some  tests  all  processors 
had  requests  in  a given  queue  position,  so  that  all  512  processors 
made  requests.  The  requests  in  the  512  processor  could  form  p- 
ordered  vectors,  or  could  be  p-q-ordered,  or  could  be  random.  The 
easiest  criterion  for  performance  evaluation  is  the  percentage  of 
the  512  requests  that  are  granted  on  the  first  EM  cycle*  Section 
B.5*3  and  B.7.4  go  into  more  detail  on  the  performance  afte)  that 
first  cycle  or  when  only  a portion  of  the  processors  are  request- 
ing access. 

Table  B.2  shows,  for  a number  of  possible  CN  designs,  this  perfor- 
mance on  the  first  cycle,  and  also  lists  the  gate  count  by  number 
of  nodes  and  as  a percentage  of  the  gate  count  of  the  TN  [1,  2]  • 
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Table  B*2  Summary  of  Simulator  and  Analyzer  Results 


The  four  variations  of  networks  with  the  best  combination  of  non- 
blocking  and  parts  count  were 

Case  1.  A two  layer  Omega  Network  (Figure  B.7)  with  361%  times  as 
much  hardware  as  the  transposition  network. 

Case  2.  A Benes  Network  with  processors  spread  (Figure  B.5) 

with  254%  as  much  hardware  as  the  Transposition  Network. 

Case  3.  One  sheet  of  Omega  network  with  164%  as  much  hardware 
as  the  TN. 

Case  4.  A two  layer  Omega  network  but  with  sheet-to-sheet  paths 
at  each  node.  This  is  estimated  to  take  about  485%  of 
the  hardware  of  the  baseline  TN. 

Table  B.2  also  includes  results  obtained  with  the  st^^chastic 
analyzer,  which  computes  the  probability  of  blockage  within  the  CN 
under  the  assumption  that  the  input  is  a random  request.  Since 
the  functional  simulator  could  not  handle  case  4,  the  stochastic 
analyzer  results  are  all  that  is  available  for  this  case.  In  the 
case  where  the  functional  simulator  and  the  stochastic  analyzer 
were  both  used,  it  is  seen  that  the  stochastic  analyzej  agrees 
with  the  simulator  results  for  the  case  of  random  requests.  For 
case  4,  thei  e are  two  outputs  from  the  CN  to  the  same  EM  module. 
The  stochastic  analyzer  does  not  count  the  two  conflicting 
requests  arriving  at  these  two  ports  for  the  same  EM  module  as  a 
blockage. 

The  data  of  Table  B.2  is  plotted  in  Figure  B.16,  where  it  shows 
the  tradeoff  between  speed  vs.  amount  of  hardware,  fot  the  CN. 
Speed  is  represented  indirectly,  as  percent  success  for  the  case 
of  all  processors  requesting  simultaneously;  hardware  is  also 
represented  indirectly,  as  a gate  count,  which  in  actuality  can  be 
only  a rough  guide  for  hardware  cost.  Three  of  the  four  cases 
previously  listed  show  on  this  figure  as  local  optima.  All  net- 
works investigated  have  the  ^.^roperty  that  a p-ordered  vector  with 
a skip  distance  of  1 found  all  512  paths  simultaneously  on  the 
first  request. 

B.5. 2 Data 

The  individual  simulation  averages  in  Table  B.2  are  repoi ted  in 
more  detail  in  Table  B.3.  The  simulator  generated  either  a 
p-ordered  vector  for  the  512  requests,  or  generated  512  random 
numbers.  Requests  could  be  queued  in  the  processors.  In  early 
runs.  Case  1,  the  two-sheet  Omega  network  was  simulated  by  com- 
bining two  successive  cycles  of  requests  of  a simulation  of  case 
3,  the  one-sheet  Omega.  When  the  original  set  of  requests  is  a 
permutation  containing  no  duplicate  EM  module  numbers,  this  is 
correct  for  all  cycles.  However,  when  the  original  set  of 
requests  is  random,  corrections  for  multiple  accesses  to  the  same 
EM  module  must  be  made  and  only  the  first  cycle  is  correctly 
simulated  for  these  early  results. 
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Table  B.3  Summary  of  Individual  Simulation  Runs 
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Table  B*3  Summary  of  Individual  Simulation  Runs  (Cont’d 


The  stochastic  analyzer  data  is  summarized  in  Table  B.4*  The 
output  of  the  stochastic  analyzer  produced  detailed  description  of 
the  blockage  of  the  network  at  each  level,  giving  a probability  of 
a requests  blocked  at  each  level.  Only  the  totals  are  shown  here. 
In  addition,  the  number  of  processors  making  a request  v/as  varied 
from  256  (50%  of  the  processors)  to  512  (100%  of  the  processors). 
The  body  of  the  table  gives  the  probability  of  a request  being 
blocked  within  the  CN,  and  hence  the  fraction  of  requests  that  are 
expected  to  be  blocked. 

In  the  case  of  the  single  Omega,  any  EM  module  conflict  will 
result  in  blockage  within  the  CN.  For  the  double  Omega  up  to  two 
requests  for  the  same  FiM  module  may  show  up  at  the  output  port. 
The  stochastic  analyzer  does  not  count  this  as  blockage. 

In  addition  to  the  actual  512-processor/521-EM  module  case,  the 
stochastic  analyzer  was  run  for  curiosity  on  a number  of  other 
sizes  of  arrays,  to  investigate  sensitivity  to  the  exact  number  of 
processors  and  EM  modules.  These  are  also  listed. 

All  the  data  of  Table  B.4  is  plotted  in  Figure  B.17. 

B. 5. 3 Discussion  of  Simulation  Experiments 

P-ordered  requests  had  considerable  variation  in  the  percentage  of 
success,  in  the  functional  simulator,  as  a function  of  the  skip 
distance  p.  This  constrasts  with  the  behavior  of  random  requests, 
whose  behavior  v;as  nearly  uniform  and  independent  of  the  seed  for 
the  random  number  generator*  Almost  all  values  of  p produce  p- 
ordered  vectors  whose  percent  of  requests  granted  is  substantially 
better  than  for  random  requests. 

Certain  skip  distances  (including  p = 1)  are  "magic",  in  all  that 
EM  accesses  are  attained  on  the  first  cycle,  with  no  interference. 
Figures  B.18  and  B.19  show  the  distribution  of  percentage  success 
over  the  various  skip  distances  tried  for  two  of  the  networks. 
The  experiment  has  a defect;  skip  distances  were  not  selected  at 
random,  but  were  partly  picked  on  hunches  that  said  they  would  be 
"magic",  with  high  success  rate,  or  "perverse",  with  low  success 
rate.  p = 17  was  expected  to  be  perverse  and  it  was*  p = 228  was 
expected  to  be  "magic"  (228  is  the  reciprocal  of  16  in  modulo  521 
arithmetic)  but  it  was  not. 

In  nonaal  operation,  processors  are  not  spending  full  time  access- 
ing EM,  but  are  spending  most  of  their  time  doing  other  things. 
Furthermore,  since  they  are  processing  independently  of  each 
other,  processor  requests  v;ill  often  get  out  of  synch  with  similar 
requests  in  other  processors.  Therefore,  a question  of  interest 
is  to  what  degree  does  the  blocking  in  the  CN  become  less  as  the 
percentage  of  requests  is  less.  Two  methods  were  used  to  investi- 
gate this  question.  First,  the  stochastic  analyzer  could  be  run 
with  the  probability  of  a request  being  issued  from  a given 
processor  set  to  various  values.  These  results  are  shown  in 
Figure  B.20.  These  results  are  for  single  Omega  (case  3),  and 
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double  Omega  with  interlayer  data  paths  (case  4),  plus  some 
non-realistic  cases*  The  second  result  was  by  simulation  with 
only  some  portion  of  the  processors  having  CN  requests.  Most  of 
the  results  of  the  second  method  were  obtained  with  an  early  ver-- 
sion  of  the  simulator  which  could  not  be  initialized  to  fewer  than 
512  requests.  However^  the  simulator  did  keep  retrying  all  re- 
quests that  failed  to  be  satisfied  on  the  first  EM  cycle.  Hence/ 
these  leftover  requests  can  be  used  to  estimate  the  response  of 
the  CN  to  the  situation  that  only  a portion  of  the  processors  are 
making  a request.  Figures  B.20/  B.21  and  B.22  are  these  data  for 
the  double  Omega  (case  1),  the  Benes  (case  2),  and  the  single 
Omega  networks  (case  3)  respectively.  Data  points  derived  from 
requests  leftover  from  p-ordered  vectors  are  marked  with  X;  points 
representing  leftover  requests  from  originally  random  retjaests  are 
marked  with  dots;  and  points  marked  with  circles  are  cases  that 
were  run  with  partial  random  requests  after  the  functional  simu- 
lator was  improved  so  as  to  initialize  the  processor  request 
queues  for  an  arbitrary  number  of  processors  less  than  512. 

B.5.4  Test  Cases  Abstracted  From  the  Aero  Flow  Codes 

In  two  directions  of  accessing/  the  aero  flow  codes  produce 
p-ordered  vectors  as  access  requests.  In  the  third  ("hard”) 
direction,  a p-q-ordered  vector  is  produced.  A full-scale  im- 
plicit might  have  dimensionality  (100/  50/  200)  leading  to  p=l  and 
q“4900=^-211  modulo  521.  The  explicit  code  as  supplied  (a  small- 
mesh  test  case)  has  dimensionality  (31/  31/  31)  leading  to  p-1  and 
q = 931=410  modulo  521.  Several  test  cases  were  run  using  the 

double  Omega/  Case  1/  which  by  that  time  was  targeted  as  the  CN 
most  likely  to  be  recommended/  and  the  results  are  shown  in  Table 
B.3/  where  they  are  called  ”p-q-ordered  with  p-1".  The  first 
sheet  of  simulator  printout/  on  the  first  CN  clock,  gives  a 

printout  for  the  first  layer  which  is  identical  with  the  first 
clock  of  a single  Omega,  hence,  this  data  is  also  listed  in  Table 
B.3. 

B.6  SEbECTION  AMONG  THE  CN  ALTERNATIVE  APPROACHES 

Four  preferred  approaches  to  Connection  Network  (CN)  implemen- 
tation were  listed  in  section  B.5.1.  The  arguments  presented 

below  show  that  the  double  Omega  network  (case  1 or  case  4)  is 

preferred*  Trade-off  studies  between  these  two  cases  are  incom- 
plete. Table  B.5  compares  the  characteristics  of  the  four  prefer- 
red cases. 
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In  Table  the  “Hardware”  column  compares  the  gate  count  of 

data  carrying  gates  of  the  various  versions  with  the  corresponding 
gate  count  of  the  Transposition  Network  considered  during  the 
Preliminary  Study  [1,  2].  This  comparison  is  used  since  package 

count  is  subject  to  uncertainties  of  packaging,  suitable  part 
availability,  etc. 

“Random  Success”  is,  for  sets  of  512  simultaneous  random  EM  access 
requests  the  average  percentage  of  that  were  serviced  on  the  first 
EM  cycle* 

»»p«ordered  Success”  is  the  corresponding  average  percentage  for 
sets  of  512  p-ordered  requests.  Substantial  variation  in  this 
percentage  from  one  set  to  another  was  observed,  although  perfor- 
mance was  consistantly  better  than  it  was  for  random  requests. 

The  percentage  given  does  not  include  p=l,  the  simple  vectors,  for 
which  the  success  percentage  is  always  100%,  as  the  next  line 
reminds  us* 

”P-q-ordered  success”  gives  the  percent  success  observed  with 
so-called  ”P-q-ordered”  vectors,  in  which  the  module  numbers  come 
from  the  set  = ( i*p+(  iDIVk)  *q)mod  521*  The  value  of  p was 

always  1 in  the  test  cases,  which  come  from  actual  aero-flow 

codes . 

B*6*l  Discussion  of  Results 

The  data  in  Table  B.  3 comes  from  a simulator  which  makes  512 

simultaneous  requests  of  the  EM  modules*  In  actual  programs,  this 
is  expected  to  happen  only  on  the  first  cycle  of  the  DOALh  on  the 
first  EM  access*  Once  some  processor  has  been  delayed  on 
accessing  EM  memory,  it  will  no  longer  be  in  synch  with  the  access 
requests  of  other  processors,  and  so  the  system  should  be  self- 
regulating for  all  but  the  shortest  DOALLs,  with  an  effective 
delay  controlled  by  the  average  access  time  observed  when  some 
fraction  of  all  the  processors  are  requesting  access  to  memory. 

Consider  a program  that  averages  five  floating  point  operations 
per  EM  access  (for  example,  the  2D  version  of  the  explicit  code, 
according  to  the  Preliminary  Study)  [1^  2].  Each  EM  access  ties 
up  the  CN  an  estimated  four  CN  clocks  of  120  ns  each,  if  the 
success  rate  were  100%*  Five  floating  point  operations  in  512 
processors  will  take  at  least  1471  ns  so  that  the  CN  would  have 
162  requests  pending  on  the  average  at  any  given  time.  Figure 

B*20  shows  that  the  percent  success  with  case  1 the  double  Omega, 
is  nearly  100%  at  this  level  of  loading.  Figures  B.21  and  B*22 
show  about  80%  success  at  210  requests  loading  for  case  2 the 
Benes  Network,  and  60%  success  at  270  requests  for  case  3,  the 
single  Omega  Network.  The  162  requests  are  42.4%  of  maximum 
loading  for  case  1,  210  requests  are  65%  of  the  maximum  loading 

for  case  2 Network  and  270  requests  are  75%  of  the  maximum  loading 
for  case  3* 
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None  of  the  cases  carries  a hardware  cost  greater  than  about  one 
third  that  of  the  set  of  516  processors.  The  paragraph  above 
shows  that  the  simple  double  Omega  (case  1)  can  handle  programs 
with  as  few  as  five  floating  point  operations  per  EM  access  and 
still  have  margin  to  accommodate  bursts  of  BM  access.  Such  bursts 
are  planned;  we  expect  a flurry  of  fetching  from  EM  at  the 
beginning  of  many  DOALLs  and  another  shorter  burst  of  stores  to  EM 
is  expected  at  the  end  of  many  DOALLs. 

The  double  Omega  Network  with  interlayer  paths  (case  4)  is  even 
better  than  the  case  1 double  Omega  at  not  blocking.  Unfortun^ 

atelYr  modification  of  the  CN  simulator  to  include  case  4 would 
have  been  a major  effort^  and  was  not  done  in  time  for  this  re- 
port. Hence  the  evaluation  of  this  network  is  incomplete. 

The  choice  between  case  1 and  case  4 both  double  Omega  Networks, 
must  take  into  account  a number  of  other  factors  if  the  choice  is 
to  be  optimized.  Among  these  are: 

* Character istics  of  the  applications  programs.  The  four 

benchmark  programs  represent  only  four  points  in  applications 
space.  The  main  characteristic  of  interest  here  is  the  number  of 
EM  accesses,  and  their  distribution  in  time. 

* Relative  cost  of  the  two  versions.  The  gate  count  is  in 
the  ration  of  1.4:1,  case  4 with  interlayer  connections  having  the 
more  gates.  If  the  CN  chips  turn  out  to  be  strictly  pin-  limited, 
the  extra  gates  may  not  cost  much  at  all* 

* Ease  of  diagnosing  hard  failures.  In  the  simple  two-layer 
network  diagnostics  are  straight  forward,  since  each  single  Omega 
network,  tested  separately,  is  easy  to  diagnose,  as  shown  in  the 
Chapter  6 of  this  report.  More  complex  hardware  controls  are 
needed  to  make  the  more  complex  version  as  easy  to  diagnose. 

B.7  ADDITIONAL  CONSIDERATIONS 

The  remainder  of  this  appendix  considers  an  assortment  of  various 
behaviors  of  the  CN  and  aspects  of  EM  accessing*  These  include 
references  to  or  discussions  of 

* Modular  partitioning 

* Mapping  of  module  number  to  CN  port  number  and  sparing 

* Processor-to-Processor  transfers 

BM  module  conflicts  for  p-q-ordered  vectors 

* Approximate  validity  of  the  assumption  of  random  EM 
module  numbers  when  EM  accesses  are  queued  within  the 
processors. 


Appendix  H contains  b>:ief  discussions  of  the  CN  simulato):  and  of 
the  stochastic  analyzer  respectively.  Listings  of  the  CN 
simulator  (prior  to  the  insertion  of  the  capability  of  testing 
p-q-ordered  requests)  and  of  the  stochastic  analyzer  have  been 
provided  to  NASA  Ames. 

Appendix  I contains  an  analysis  of  the  connectivity  of  various 
networks  which  was  performed  soon  after  CN  considerations  began. 

B.7.1  Modular  Partitioning 

Note  the  division  of  an  Omega  network  (Figure  B.23  is  a 16  x 16 
Omega  network)  into  distinct  upper  and  lower  halves  after  the 
first  level  of  nodes,  and  into  quarters  after  the  second.  It  is 
expected  that  aftej  the  second  level  of  nodes,,  identical  quarters 
can  be  put  into  each  of  the  four  EM  cabinets.  Thus,  the  CN  would 
not  physically  exist  as  a single  central  item  except  possibly  for 
the  first  two  levels  of  nodes. 

B.7.2  Mapping 

Mapping  is  described  in  adequate  detail  in  Chapter  5,  and  need  not 
take  much  space  here.  The  probable  mappings  are  as  follows: 

The  CN  ports  on  the  processor  side  are  numbered  0 to  1023.  The 

first  seven  bits  plus  the  least  significant  bit  will  be  called  CN- 

port-within-cabinet . The  two  intervening  bits  are  the  cabinet 
number.  Within  the  cabinet,  the  processors  are  numbered  0 through 
128,  including  the  spare.  Processors  0 through  127  are  assigned 
port  numbers  as  follows;  reverse  the  processor  number  end  for  end, 
least  significant  bit  to  most  significant  bit  position,  and  vice 
versa,  multiply  this  result  by  2.  The  result  at  this  point  is 
CN-port-number-within-cabinet.  Processor  128  is  assigned  to 

CN-por t-number-within-cabinet  No.  1.  All  others  have  even  num- 
bered ports. 

The  CN  ports  on  the  EM  module  side  are  numbered  from  0 to  1023. 
There  are  525  EM  module  slots,  and  hence,  525  CN  port  numbers  to 

be  assigned.  EM  module  numbers  0 through  511  are  assigned  to  the 

even  port  numbers  from  0 through  1022  respectively.  The 
additional  16  slot  numbers  are  assigned  four  per  cabinet  as  shown 
in  Equation  B.6. 

CN  Port  No.  = 32  X (EM  No.  modulo  512)  + 1 for  EMno>  511.  (B.6) 

On  the  EM  side,  •■he  most  significant  two  bits  of  port  number  are 
the  cabinet  number. 
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sparing  of  BM  modules  would  be  accompanied  by  replacing  a 
reference  to  a failed  EM  module  with  a reference  to  one  of  the 
spares  (numbered  521  through  524) • The  remapping  of  such  a 
reference  would  occur  in  the  CN  buffer.  The  remapping  carried  out 
in  the  CN  buffer  would  change  up  to  four  EM  module  numbers  from 
their  normal  CN  destination  port  number  assignments  to  the  CN 
destination  port  number  for  the  spares.  Sparing  of  processors  is 
done  by  designating  one  as  spare ^ whereupon  all  processors  whose 
physical  location  numbers  are  higher  than  the  physical  location  of 
the  spare  within  the  same  cabinet,  interpret  physical  location 
minus  one  as  their  processor  number. 

B.7.3  Processor  to  Processor  Transfers 

SHIFCN  using  "wraparound”  in  the  simplest  way  is  effective  only 
when  processor  in  physical  location  128  in  each  cabinet  is  desig- 
nated as  the  spare.  The  "wraparound"  command,  as  described,  makes 
connection  between  two  CN  ports  whose  numbers  differ  only  in  one 
bit  position.  Although  the  positions  of  the  bits  in  a CN  port 
number  are  different  than  the  positions  of  the  bits  in  a processor 
physical  location,  they  are  the  same  bits,  rearranged  (swapped  end 
for  end  and  shifted  by  one).  Thus,  to  get  bits  of  processor 
number  to  correspond  to  bits  of  CN  port  number,  we  must  have  the 
processors  in  the  first  128  out  of  the  129  physical  slots. 

Thus,  some  modification  to  the  simple  "wraparound"  described  in 
the  previous  sections  is  called  for  in  order  to  accomodate  both 
sparing  and  the  SHIFCN  instruction.  The  SHIFCN  instruction  is  not 
used  anywhere  in  the  aero  flow  or  weather  codes  except  as  part  of 
the  SUMALL  function.  In  SUMALL,  since  the  use  of  SHIFCN  is  hidden 
inside  system  software,  deficiencies  of  SHIFCN  could  be  avoided  by 
pjograniming . However,  the  SWAP  function  will  require  either  a 
solution  to  the  SHIFCN  problem  exposed  above,  or  else  a store  to 
BM  followed  by  a fetch  from  recalculated  addresses. 

B.7.4  EM  Module  Conflicts  on  p-q-ordered  Vectors 

Failure  to  access  all  512  memory  words  in  parallel  can  be  due  just 
as  much  to  request  conflicts,  where  several  processors  are  trying 
to  access  one  memory  module,  as  to  CN  blockage.  Case  3,  the 
single  Omega  has  the  property  that  all  EM  module  conflicts  are 
eliminated  by  a CN  blockage  that  occurs  somewhere  within  the  CN. 
These  Blockages  that  resolve  conflicts  should  not  be  blamed  on  a 
CN  inadequacy,  since  even  a perfect  CN  will  not  eliminate  the  orig- 
inal conflicts. 

Depth  of  conflict,  or  "pileup”,  is  defined  as  the  number  of  proces- 
sors requesting  the  same  EM  module  on  one  CN  cycle.  Pileup  is  not 
to  be  confused  with  the  queues  of  requests  within  the  processor, 
which  could  conceivably  contain  even  more  requests  for  the  same  EM 
module,  but  these  would  not  come  to  light  until  some  later  CN 
clock. 
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P-q-ordered  vectors  occur  frequently  in  the  aero  flow  codes  (in 
the  "hard”  direction)*  When  arrays  are  placed  in  Extended  Memory 
with  successive  elements  in  adjacent  EM  modules  and  when  the  pro- 
cessors are  each  accessing  an  element  of  a p-q-ordered  vector  con- 
current with  all  the  other  processors ^ then  EM  conflicts  of  the 
sort  just  described  can  occur*  This  situation  is  discussed  below* 


Table  B. 6 

Worst  "p-q-ordered”  cases 


Array  Dimensons 

Pileup 

Array  Dimensions 

Conflict  Depth 

20  X 26 

20 

29  X 18 

18 

26  X 40 

13 

29  X 36 

14 

39  X 40 

13 

34  X 46 

16 

42  X 62 

13 

41  X 89 

13 

50  X 73 

11 

45  X 81 

12 

43  X 97 

12 

49  X 85 

11 

34  X 23 

Since  any  array  size  declaration  picked  at  random  is  not  likely  to 
be  one  of  the  bad  cases , and  since  the  bad  cases  are  all  smaller 
than  the  problem  sizes  for  which  the  PMP  is  targeted,  the  problem 
would  appear  to  be  a minor  one/  of  the  sort  most  conveniently  han- 
dled by  having  the  compiler  issue  a warning  to  the  programmej  v/hen 
one  of  the  bad  cases  is  seen*  The  depth  of  conflict  can  never  be 
more  than  the  number  of  p-ordered  pieces  in  a p-q-ordered  vector, 
since  the  p-ordered  pieces  never  have  conflicting  access 
internally* 

In  Table  B*6,  the  number  of  conflicts  may  be  different  depending 
on  the  order  of  M,  N*  Usually,  array  dimensions  {M,  N,  X)  where  M 
is  less  than  W,  have  more  conflicts  than  (N/  M,  X)*  In  the  table, 
the  worst  of  the  two  cases  is  listed* 

Figure  B*24  shows  an  example  of  the  pileups  that  occur  when  an 
adverse  p-q-ordered  vector  is  accessed  from  a smaller  number  of  EM 
modules*  For  example,  the  number  of  modules  and  the  number  of 
processors  are  both  11*  The  vector  being  fetched  is  Mj^  « (3  + l*i 
+ 9*(i  BIV  3))  modulo  11  for  0 i 10*  The  top  portion  of  the 
figure  shows  the  address  space  in  these  11  modules,  plotted  within 
the  two-dimensional  representation  based  on  module  number  vs. 
address  within  module*  The  addresses  being  accessed  are  marked 
with  an  asterisk*  The  lower  part  of  Figure  B.24  shows  the  result- 
ing pileups*  In  this  case,  the  worst  pileups  are  of  depth  three 
at  module  numbers  five  and  six. 
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I£  q plus  length  of  piece  is  nearly  equal  to  521,  then  the 

successive  pieces  of  vector  will  tend  to  coincide  in  the  same  EM 

modules,  generating  substantial  conflicts.  When  an  array  has  a 
dimensionality  (M  N,  X),  and  the  DOALL  is  on  the  first  and  third 
subscript,  the  result  is  a p-q-ordered  accessing  of  that  array 
with  p«l  and  q+K«M(N-l)*  All  numbers  that  are  close  to  multiples 
of  521  which  can  be  factored  into  an  M and  an  N that  are  within  a 
factor  of  two  of  each  other  were  surveyed.  The  depth  of  conflict 
in  the  most-accessed  EM  module  for  each  of  these  cases  was 
computed,  Pileups  can  also  occur  when  MN  is  close  to  260  modulo 
521,  Out  of  all  possible  pairs  of  numbers  M,  N that  lie  within 
the  above  range,  exactly  fifteen  pairs  generate  EM  module 
conflicts  that  are  10  deep  or  more  (listed  in  Table  B.6),  The 
worst  case  is  M,  N » 20,  26  which  yields  a depth  of  conflict  of  20 

in  six  memory  modules,  and  which  takes  26  cycles  of  accessing  to 

resolve,  as  shown  by  simulation. 

B , 7 , 5 Non-Randomness 

Given  a random  set  of  EM  module  numbers  as  a request,  thej  e will 
be  conflicts  at  some  of  the  memory  modules.  After  the  f i) st  cycle 
of  satisfying  the  requests  has  occurred,  the  memory  module  with  an 
N-way  conflict  will  still  have  an  N-way  or  (N-l)-way  conflict* 
Hence,  if  there  is  a succession  of  random  requests  for  memory  in 
the  processors,  the  leftover  requests  will  tend  to  bunch  up  to 
some  configuration  that  is  wojse  than  a random  request.  In  order 
to  test  this  effect,  a test  case  was  run  with  all  512  processors 
each  having  a queue  of  three  random  requests.  The  case  1 double 
Omega  network,  was  used  foj'  simulation  purposes. 

Tables  B,7  and  B.8  trace  the  history  of  this  test  through  the  12 
EM  cycles  that  it  took  to  satisfy  all  processors.  For  each  cycle 
Table  B.8  gives  the  number  of  processors  requesting  meraoiy,  the 

number  of  memory  modules  over  which  such  a request  is  expected  to 
fall  (by  ref*  1),  the  smaller  number  of  memory  modules  that  the 

bunched-up  requests  actually  asked  for,  the  percentage  the  number 
of  memory  modules  actually  reached  (any  difference  between  the 

second  and  third  is  due  to  conflicts  in  the  CN) , the  percentage  of 
non-blocking  in  the  CN) , and  finally,  the  length  of  the  longest 
pileup  observed*  Table  B*7  gives  the  histo  y of  memoi  y module 
conflicts  per  cycle  for  this  test.  Cycle  11  included  one  proces- 
sor that  was  requesting  the  second  item  in  the  processor’s  queue 
of  three  items*  This  lone  processor’s  thi)d  item  constitutes 

cycle  twelve. 


TABLE  B*7  Pilueup  History 
Number  of  EM  Modules  with  Specific  Conflict  Depths 
Depth  Depth  Depth  Depth  Depth  Depth  Depth  Depth 
Cycle  1 2345678 
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Prom  Table  B.8  one  can  see  that  on  Cycle  1 there  were  512  re- 
quests, and  that,  if  purely  random,  one  should  expect  these 
requests  to  involve  327*2  EM  modules  on  the  average.  There  were 
327  memory  modules  in  this  first  cycle,  whose  requests  come  direct- 
ly from  the  random  number  generator.  In  subsequent  cycles,  there 
are  always  slightly  fewer  EM  modules  being  requested  than  one 
should  expect  if  the  number  of  processors  requesting  we)  e issuing 
random  requests.  At  cycle  5,  there  are  only  194  different  memory 
modules  in  the  282  requests  being  issued  by  282  processojs,  where- 
as if  those  282  requests  were  random,  one  expects  217.9  different 
memory  modules  to  be  named.  This  is  the  worst  bunching  of  re- 
quests seen  in  the  whole  run. 

Whethe)  these  results  are  statistically  significant  was  not  analy- 
zed; they  might  be  within  the  normal  range  of  random  variations. 
Whether  significant  or  not,  the  indication  is  that  the  expected 
bunching  effect  is  fairly  small. 

B.7.6  Redundancy 

The  double  Omega,  case  1,  network  has  the  proper y that  either  half 
can  be  disconnected  from  the  system  under  coordinate)-  control. 
This  feature  is  provided  to  increase  system  availability,  since 
the  double  Omega  with  one  of  its  networks  turned  off  is  piecisely 
the  single  Omega,  case  3,  and  will  support  FMP  program  execution, 
but  at  some  increase  in  effective  EM  access  time. 
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B.8  CONCLUSION 

The  study  has  shown  that  the  double  Onega  Netwoirk  (case  1 of  this 
discussion)  can  be  expected  to  give  the  required  performance  at 
reasonable  cost*  Its  performance  has  been  validated  by  simulation 
and  analysis*  Various  options  giving  either  higher  performance  or 
lower  cost  have  also  been  presented*  Additional  options  were 
considered  during  the  course  of  this  study,  but  were  omitted  from 
this  discussion  in  order  to  avoid  digressions* 

Although  sufficient  study  has  been  completed  to  give  confidence  in 
the  feasibility  of  the  Connection  Network  in  the  PMP  architecture, 
cost/performance  trade-offs  deserve  to  be  further  considered. 
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INSTRUCTION  SET  AND  TIMING  INFORMATION 
C.l  INTRODUCTION 

The  instruction  set  has  undergone  substantial  refinement  since  the 
instruction  set  of  the  Preliminary  Study  [1,2].  Additional  func- 
tions have  been  identified,  including  the  necessity  for  hardware 
double  precision,  a "read  with  lock"  operator  in  Extended  Memory, 
additional  operators  for  the  system  software,  and  so  on.  The 
unsynchronized  CN  has  required  substantial  changes  in  the 
operators  that  access  Extended  Memory,  including  the  addition  of  a 
MOD  521  operator  in  every  processor,  and  the  elimination  of  the  CN 
controls  from  the  coordinator  for  EM  accessing. 

One  set  of  processor  instructions  is  known  to  be  necessary,  namely 
a set  of  operators  to  allow  formatting  of  output,  and  unformatting 
of  input.  These  have  yet  to  be  specified.  Insofar  as  the 
instruction  set  presented  here  still  does  not  have  them,  it  is 

incomplete.  In  evaluating  the  processor  against  the  benchmark 
aero  flow  codes  and  weather  codes,  these  character-manipulating 
operators  are  not  needed,  even  though  they  will  be  needed  in  a 
final  design. 

Table  C.l  is  a listing  of  the  instructions.  it  is  divided  into 
three  sections.  Processor  instructions  are  in  the  first. 

Commands  issued  by  the  coordinator  and  effected  in  the  processor 
are  the  second.  Coordinator  instructions  are  the  third.  Since 
every  processor  is  a serial  scalar  processor,  and  can  execute 

scalar  code  on  data  residing  in  EM,  no  separate  513th  "scalar" 
processor  is  any  longer  required,  nor  are  the  "scalar  unit"  in- 
structions of  the  baseline  system  any  longer  included.  Hence,  no 
floating  point  instructions  are  listed  for  the  CR.  If  floating 

point  requirements  become  identified  in  the  system  software  that 
executes  on  the  CR,  there  will  have  to  be  floating  point  capabi- 
lity included  in  the  CR. 


Table  C.2  at  the  end  of  this  appendix  is  a list  of  the  timing  of 
the  instructions.  The  format  used  is  similar  to  that  used  in  the 
Preliminary  Study  [2],  except  that  the  instruction  descriptions 
have  been  moved  to  Table  C.l. 

C.2  DESCRIPTION  OF  TABLE  C.l 

The  Table  C.l  is  a description  of  the  complete  instruction  set.  A 
buffer  register  interfaces  the  CN,  and  that  the  buffer  can  hold  an 
address  and  a word  of  data  for  a STOREM,  or  accept  a word  of  data 
from  EM  on  a LOADEM,  without  interfering  with,  or  requiring  assis- 
tance from,  whatever  instruction  is  in  the  processor.  Hence, 
instructions  which  access  this  buffer  must  be  able  to  test  whether 
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it  is  **busy”/  dedicated  to  an  uncompleted  LOADEM  or  STOREM,  and 
whether  or  not  it  is  **full“*  To  a large  extent,  these  tests  on 
"full”  and  "busy"  replace  the  waiting  for  "go"  in  the  baseline 
system  of  the  instruction  set*  For  example,  a STOREM,  having  told 
this  buffer  to  empty  itself  at  the  designated  memory  module,  need 
not  wait  for  anything  more  to  happen,  but  the  next  instruction  may 
start  immediately* 

The  list  has  been  simplified  by  using  condensed  notation*  A 
"(L,M)"  following  a mnemonic  means  that  either^  L or  M can  be 
appended  to  the  mnemonic  to  create  other  instructions  in  which  the 
designated  operand  can  come  from  memory,  or  is  literal,  instead  of 
register  two*  Likewise,  instructions  with  almost  identical 
descriptions  will  be  combined  into  a single  description. 

An  ”P"  prefix  designates  a floating  point  operation  using  floating 
point  registers,  "I"  designates  an  integer  operation  using  integer 
registers,  and  "C"  is  the  coordinator,  using  the  integer  registers 
in  the  coordinator. 

The  symbol  designates  concatenation*  "Next"  designates  the 

register  next  after  the  designated  one.  Names  in  quotes  are 

specific  control  bits*  designates  that  the  data  just 

described  is  to  be  inserted  into  the  location  designated  just 
after. 

Major  changes  from  the  Preliminary  Study  [2]  are  listed  in  the 
following  paragraphs. 

Most  synchronizations  are  put  onto  the  CN  buffer  so  that 
individual  instructions  are  not  held  up  waiting*  "I  got  here"  is 
set  by  one  instruction,  and  then  usually  tested  at  some  later  time 
to  see  if  "go"  reset  it,  although  WAIT  and  LOOP  still  wait  for 
"go".  LOADEM  and  STOREM  with  the  new  CN  are  completely  free  of 
any  synchronization  requirements,  thanks  to  the  CN  buffer* 

The  instructions  by  which  the  coordinator  causes  diagnostics  to  be 
imposed  on  the  processor  are  more  complete  in  this  list  than 
previously* 

The  data  path  directly  from  coordinator  to  processor  through 
fanout  boards,  of  ref.  1 and  2 has  been  eliminated*  Instead,  the 
coordinator  has  been  given  access  to  a CN  port,  which  can  be  then 
set  to  a "broadcast"  condition  where  it  connects  to  all  processor 
parts  in  parallel*  The  control  path  from  coordinator  to  proces- 
sors remains* 

Double-precision  floating  point  has  been  included*  Double-length 
format  is  two  words  in  single  precision  format,  with  an  exponent 
difference  of  36,  and  with  the  second  word  not  necessarily 
normalized. 

Several  corrections,  such  as  incrementing  before  testing  in  ITIX 
and  CTIX,  also  make  these  instructions  differ  from  the  previous 
description* 


C . 3 MICROPROGRAMMABILITY 


Burroughs r on  its  own  funds,  has  been  building  an  evaluation  model 
of  a processor  similar  to  the  single  PMP  processor  (see  Appendix  E 
of  ref,  1),  This  exercise  shows  that  the  preferred 
implementation,  even  for  a fixed  instruction  set,  will  be  instruc- 
tion decoding  by  ROM  or  PROM,  Hence,  there  will  be  room  to  modify 
the  instruction  set  until  fairly  late  in  the  design  cycle,  as  long 
as  the  new  instructions  use  the  same  basic  hardware  resources  as 
the  defined  instructions.  Thus,  for  example,  a Newton-Raphson 
square  root  could  be  included  as  a microprogrammed  instruction, 
but  the  square  root  algorithm  that  uses  a slight  modification  of 
the  divide  algorithm  would  involve  a one  or  two  gate  change  in  the 
arithmetic  chip  and  could  not.  Double  precision  instructions  are 
microprogrammed  from  single-precision  hardware. 


C.4  COORDINATOR  OPERATIONS 

In  all  test  cases  extracted  from  aero  flow  or  weather  codes,  the 
coordinator  has  nothing  to  do  for  long  stretches  of  time,  only  an 
occasional  SYNC  instruction  to  enforce  the  data  precedence 
conditions  at  the  end  of  the  DOALL. 

On  the  other  hand,  the  coordinator  will  have  system  functions  to 
perform,  such  as  responding  to  I/O-complete  interrupts  at  the  end 
of  DBM-EM  transfers.  These  two  functions  are  interlaced  at  the 
same  instruction  execution  station;  "all  processors  ready"  is  an 
interrupt  that  is  allowed  in  system-function  code  execution,  and 
masked  off  in  user  code,  so  that  system  functions  can  be  executed 
during  the  long  waits  in  coordinator  user  code. 

C . 5 FORMATS 

This  instruction  set  is  presented  to  demonstrate  feasibility  of 
the  FMP.  Some  of  the  assumptions  underlying  this  instruction  set 
could  conceivably  be  changed  during  the  actual  design  of  the  PMP. 
These  assumptions  include  addressible  registers,  a desire  to 
sometimes  use  absolute  addresses  and  a data  word  size  of  48  bits. 
A data  word  size  of  48  bits  points  to  48  bits  and  its  submultiples 
as  preferred  instruction  sizes  also.  This  instruction  set  assumes 
24  and  48-bit  instructions.  Within  24  bits  we  get  an  opcode  and  3 
register  addresses;  or  an  opcode,  twi  register  address  and  a 
7-or-8-bit  countfield;  or  a 4-bit  opcode,  a register  address,  and 
a 16-bit  literal.  Within  48  bits  we  can  get  two  address-sized 
fields  with  or  without  index  designations  plus  one  register 
address  and  an  opcode;  or  one  address-sized  field  and  two  or  three 
register  addresses. 

If,  instead,  one  assumes  16,  32,  and  48-bit  formats,  the  register 
instructions  would  largely  be  two-address,  either  Regi  op  Reg2 
Regi  or  Reg^  op  Reg2  Accumulator,  in  order  to  fit  the  common 
instructions  into  16  bits. 
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C*6  ADDRESSING 


The  address  field  (18  bits)  consists  of  either  ”00”  + 16-bit 
absolute  address,  ”01”  + 16-bit  literal,  ”10”  + 4-bit  register 
identifier  + 12  bits  offset,  or  ”11”  undefined  so  far. 

Absolute  addressing  is  intended  to  be  used  only  for  system  soft- 
ware and  for  FORTRAN  common.  Simple  variables  and  ”descriptors” 
have  relative  addresses  with  respect  to  the  stack  pointer,  just 
like  in  B 6700,  and  12  bits  should  be  enough.  "Descriptions”  is 
loosely  used  to  refer  to  base  addresses  of  named  common  areas  and 
base  addresses  of  local  arrays  (or  ”IN  ALL”  arrays  whose  scope  is 
within  the  subprogram  only) . 

In  test  cases,  12  bits  was  enough  to  access  any  element  of  any 
local  array.  A base  address  of  a local  array,  once  fetched  to  a 
local  register,  can  be  used  for  several  accesses  to  that  array. 
When  a single  computed  address  is  not  enough  then  the  restriction 
to  only  one  register  that  can  be  added  to  the  offset  creates  some 
additional  integer  arithmetic  that  has  to  be  programmed.  The  test 
cases  show  enough  cases  where  the  programming  consists  of  a single 
integer  add,  as  to  suggest  that  a fourth  address  format  ought  to 
be  ”11”  followed  by  two  four-bit  integer  register  addresses  and  an 
8-bit  address.  The  saving  is  one  of  code  file  size  only,  and  not 
directly  in  execution  time,  since  the  act  of  adding  two  integers 
together  takes  one  clock  whether  those  integers  are  specified  in  a 
separate  lADD  instruction  or  specified  as  indices  associated  with 
an  address  field.  Such  double  indexing  would  add  one  clock  to  the 
beginning  of  any  instruction  in  which  it  occurred.  It  is  not 
included  in  this  description. 

C.7  NUMBER  OP  INSTRUCTIONS 

How  reasonable  is  the  expectation  that  the  opcode  field  will  be  8 
bits?  In  this  list  are  174  processor  instructions,  64 
floating-point-only,  79  integer-only,  and  31  other.  Character  or 
string  operators  still  are  to  be  added.  There  are  100  coordinator 
instructions,  of  which  29  are  for  system  and  diagnostic  actions. 
Some  instructions  occur  very  frequently,  and  it  is  worth 
shortening  the  opcode  to  pack  them  into  a smaller  word.  For 
example,  IMOVEL  and  IJUMP  are  candidates  for  being  24-bit 
instructions.  If  they  are,  then  their  opcode  is  only  4 bits  long, 
and  they  each  occupy  16  of  the  256  slots  in  an  8-bit  opcode  space. 
It  is  not  possible  to  have  a floating-vs. -integer  bit  in  the 
opcode,  and  a half-word  vs.  full-word  bit  too,  leaving  64  instruc- 
tions in  each  category. 

C.8  INSTRUCTION  EXECUTION  TIMING 

Timings  are  given  in  Table  C.2.  For  the  processor  instructions 
there  are  four  separate  functional  units  involved.  Each 
instruction  has  a starting  time  in  each  of  the  three  units  and  an 
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ending  time  or  does  not  use  that  unit*  The  time  of  execution  of 
each  instruction  is  dependent  on  its  time  of  occupancy  (if  any)  in 
each  of  the  first  three  independent  execution  units,  namely: 
integer  unit,  floating  point  unit,  and  memory  controls*  The 

timing  is  described  most  easily  with  respect  to  the^  instruction 
fetching  process,  which  determines  the  starting  time  of  each 
successive  instruction*  The  fourth  function  unit,  the  CN  buffer, 
allows  EM  fetches  and  stores  to  transpire  in  parallel  with  other 
processing.  It  executes  independently,  once  started,  and  does  not 
affect  the  starting  of  the  next  instruction,  but  may  affect  the 
starting  of  the  next  instruction  to  use  the  CN  buffer* 

Entries  in  the  table  have  the  following  significance: 

"No.  of  clock  periods"  is  the  number  of  clocks  from  when  the 
instruction  normally  issues  to  a functional  unit,  to  the 

termination  of  the  instruction*  The  instruction  will  always  have 
been  decoded  from  out  of  the  staging  register  for  at  least  one 
clock  prior  to  this. 

"Unit  busy"  is  of  the  form  n-m,  where  n is  the  number  of  the 
latest  clock  that  previous  instruction  is  allowed  to  occupy  this 
unit,  and  m is  the  last  clock  that  this  current  instruction 
occupies  this  unit* 

Some  instructions  stop  the  instruction  fetching  process  for  a 
while,  until  the  coordinator  or  CN  buffer  restarts  it.  The  clock 
times  given  for  these  instructions  represent  the  time  from^  first 
decoding  such  an  instruction  in  the  staging  register,  until  the 
start  of  decoding  of  the  next  instruction,  under  the  most 
favorable  circumstances.  These  are  WAIT,  STOP,  HELP,  and  any 
instruction  using  the  CN  buffer. 

C.8.1  Instruction  Fetch  Timing 

Timing  of  the  instruction  fetching  mechanisms  can  be  seen  with 
respect  to  Figure  C.l*  The  next  instruction  is  being  held  in  a 

staging  register*  Out  of  the  staging  register  is  decoded  the 

start  times  required  for  the  functional  units  if  this  instruction 
were  to  start  at  this  clock,  and  the  time  it  will  occupy  the 

holding  register.  Also  decoded  are  CN  buffer  requirements*  Out 
of  the  integer,  the  floating  point,  and  the  memory  control 
functional  unit  is  decoded  the  ending  time  associated  with  the 
currently  executing  instruction.  Out  of  the  CN  buffer  are  the  "I 
got  here",  "busy"  and  "full"  conditions.  The  "scoreboard" 
compares  all  inputs*  When  all  comparisons  say  the  next 

instruction  will  not  interfere  with  current  instructions,  the 
instruction  is  transferred  from  the  staging  register  to  the  one  or 
more  functional  unit  instruction  registers*  If  delayed  starts  in 
other  functional  units  are  part  of  this  instruction,  the 
instruction  is  passed  to  the  holding  register  to  free  the  staging 
register  for  the  next  instruction* 
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Figure  C*1  Instruction  Fetching  Mechanism 


The  program  counter  always  points  to  the  next  word  in  memory  after 
the  staging  register  contents*  Thus,  normally  the  PM  will  be 
holding  teh  next  instruction  word  statically  at  its  output  lines* 
Only  when  the  staging  register  is  unloaded  in  less  than  three 
clocks  (the  PM  cycle)  or  PM  is  accessing  data  will  the  next  word 
not  appear* 

A complexity  is  the  existence  of  half-word  and  full-word 
instructions.  Second  halves  of  instructions  words  carry  the  next 
half  word  instruction,  so  full-word  instructions  may  only  have 
their  first  half  present  in  the  staging  register.  The  first  half 
is  sufficient  to  determine  the  timing*  However,  the  second  half 
will  contain  any  memory  addresses,  so  when  a fetch  from  memory  is 
involved,  the  second  half  must  also  be  fetched  before  the  memory 
part  of  the  operation  can  start* 

Those  instructions  which  contain  a memory  address  (either  for  data 
or  as  a branch  address),  or  a literal,  are  full-word  48-bit 
instructions.  Others  are  24  bits.  PL,  floating  literal,  is  one 
and  a half  words. 

The  arithmetic  timings  assume  perfect  rounding  on  single  length 
floating  point  operations,  but  that  the  excess  precision  makes 
rounding  unnecessary  on  double  length  operations. 

Instructions  labelled  "branch"  will  cause  all  lookahead  to  hold  up 
until  the  direction  that  the  branch  takes  is  determined*  Branches 
defeat  overlap.  If  the  branch  is  taken,  there  will  be  additional 
five  clocks,  three  for  fetching  the  instruction  and  two  for  fill- 
ing the  instruction  lookahead  mechanism,  before  the  instruction 
after  the  branch  can  start  executing* 

An  alternative  method  of  providing  branching  capability  is  to 
separate  the  testing  operation,  which  sets  one  or  more  result 
bits,  and  the  branch  instructions,  which  test  those  bits.  This 
method  has  the  advantage  that  one  can  define  a scheme  for  having 
lookahead  fetch  instructions  along  the  branched-to  path,  rather 
than  in  the  fall-through  direction.  Since  branches  are  usually 
taken,  some  slight  improvement  in  performance  would  accrue,  in 
addition  to  which  the  instruction  set  becomes  somewhat  simplified. 

The  instructions  CLOADEMN,  and  CSTOREMN  are  assumed  to  be 
implemented  as  a microprogrammed  sequence  of  successive  single  EM 
accesses*  These  could  be  substantially  speeded  up  if  hardware 
were  added  so  that  the  EM  module  could  recognize  these  commands  as 
different  from  single  LOADEM  and  STOREM,  and  keep  the  CN  path 
locked  up. 

The  CN  clock  frequency  is  the  third  (submultiple)  of  the  main 
clock  frequency*.  With  the  main  clock  40  ns,  the  CN  clock  is  120 
ns*  All  passing  of  addresses  and  data  through  the  CN  will  be 
synchronized  with  this  CN  clock.  Thus,  the  6,  9,  12,  or  15 

clocks  taken  by  instructions  that  pass  data  through  the  CN  are 
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actually  2,  2,  4,  or  5 CN  clocks*  Operations  involving  the  CN 

buffer  only/  such  as  loading  its  registers  or  testing  its 
flipflops/  can  be  done  on  any  processor  clock,  and  are  not  locked 
to  any  one  of  the  three  phases  of  the  CN  clock*  For  example # the 
instruction  FSTOREM  takes  three  clocks  to  load  address  and  data 
into  the  CN  buffer  if  it  finds  it  free*  These  clocks  do  not 
depend  on  CN  clock  phase*  However,  the  minimum  of  6 clocks  that 
the  CN  buffer  is  busy  involves  sending  data  to  EM,  and  can  be  the 
minimum  of  6 only  if  the  SOTREM  loads  the  data  into  the  CN  buffer 
at  the  proper  CN  clock  phase* 

For  an  example  of  these  timing  rules  applied,  see  Reference  2* 

C*8*2  Coordinator  Timing 

The  coordinator  has  a similar  set  of  independent  units*  There  is 
an  arithmetic  unit  similar  to  the  processor’s  integer  unit*  There 
is  a memory  control  unit*  For  accessing  EM,  there  is  a CN  buffer 
unit  identical  to  those  found  in  each  processor*  The  coordinator 
also  has  access  to  a port  on  the  EM  side  of  the  CN,  from  which  it 
can  broadcast  data  to  all  processors,  and  "harvest"  data  from  all 
processors*  This  second  port  is  part  of  the  arithmetic  unit,  for 
timing,  and  the  compiler  will  ensure  that  the  CN  is  idle  whenever 
the  instructions  that  use  this  port,  mostly  the  instructions  that 
are  included  for  diagnostics,  are  used.  These  are  the 
instructions  from  BDCST  through  READPM  in  Table  C*2.  Although 
they  use  the  CN,  they  do  not  use  the  CN  buffer* 

The  diagnostic  controller  is  not  used  during  normal  program 
running.  It  is  used  only  for  diagnostics  and  system 
initialization.  Hence,  diagnostic  controller  information  is  not 
required  to  generate  timing  information  about  user  programs* 

C.8*3  Synchronization 

Synchronization  enters  into  the  timing  analysis  in  two  ways* 
First,  the  instructions  that  use  the  CN  buffer  may  test  to  see 
whether  "I  got  here"  is  up,  and  may  test  whether  the  CN  buffer  is 
"full”,  "busy"  or  neither.  The  actual  tests  required  are  listed 
in  the  descriptions  of  the  individual  instructions*  These 
instructions  then  wait  until  the  CN  buffer  takes  on  the 
appropriate  state  before  continuing.  Some  of  these  instructions 
leave  the  CN  buffer  with  an  unexecuted  command,  such  as  STOREM 
that  will  be  "busy”  until  the  address  and  data  has  been 
successfully  emptied  into  an  EM  module,  or  COADBM  which  will  be 
"busy"  until  data  comes  back  from  EM  to  make  it  "full".  The 
processor  will  be  free  to  go  on  executing  any  instruction  except 
those  which  depend  on  the  CN  buffer  having  gotten  to  the  new 
state*  Some  CN  buffer  states  require  action  on  the  part  of  the 
coordinator.  For  example,  only  after  all  processors  execute 
EMPILIi  can  the  coordinator  execute  the  HVST  instructions*  Only 
after  all  processors  execute  EMREQ  can  the  coordinator  execute  the 
corresponding  FBTCHEM  or  BDCST.  Only  after  the  coordinator  has 

executed  FETCHEM  or  BDCST  can  the  processor  execute  the  REM 
instruction  that  accepts  the  broadcast  data* 


The  second  synchronization  method  involves  single  processor 
instructions  such  as  WAIT.  The  processor  checks  to  see  if  "I  got 


here"  is  down  from  any  previous  case.  If  not^  it  waits  for  "go*' 


to  come  from  the  coordinator  to  reset  "I  got  here".  Then  the 
processor  raises  "I  got  here",  and  waits  for  "go"  before  fetching 
the  next  instruction. 


C.8.4  Exceptional  Cases 


Within  the  processor,  all  fault  cases  result  in  an  interrupt  to 
system  software  that  is  resident  in  the  processor  estimated  at 
less  than  IK  words.  It  is  possible  to  handle  some  interrupts 
without  interrupting  the  CR  Floating-point  out-of-range  detec- 
tion does  not  cause  interrupts,  but  results  in  setting  the 
floating-point  variables  into  "infinity"  or  "infinitesimal".  Any 
integer  overflow  causes  an  interrupt,  on  the  theory  that  most 
integer  operations  are  address  calculations  and  overflow  repre- 
sents a faulty  address.  Attempting  to  insert  a number  outside  the 
range  ±2^5-1  into  a 16-bit  integer  register  causes  an  integer 
interrupt?  likewise  executing  a PIXD  (double-length  integer)  on  a 
number  outside  the  range  ±2^^-!  results  in  interrupt.  Any 

detection  of  error  in  the  error-detection-correction  logic  results 
in  processor  interrupt.  When  the  error  is  correctible,  the 
interrupt  merely  logs  its  occurrence  and  returns  to  user 
processing  within  a few  microseconds. 


The  processor  enters  interrupt  mode  whenever  any  bit  of  the  inter- 
rupt register,  not  disabled  by  the  corresponding  bit  of  the  mask 
register,  is  set.  The  "interrupt”  mode  flipflop  is  visible  to  the 
coordinator,  which  can  interrogate  whether  any  processor  is  in 
interrupt  mode.  One  of  the  bits  of  the  coordinator  interrupt 
register  is  the  "all  processors  ready"  signal,  thereby  allowing 
the  coordinator  to  perform  system  software  functions  during  its 
long  waits  in  user  program* 


Note  that  there  are  two  lines  from  the  processor  to  the  coordi- 
nator that  can  be  called  "interrupt"  lines.  The  processor  HELP 
instruction  raises  an  "interrupt"  line  that  sets  the  "processor 
interrupting"  bit  in  the  coordinator  * s interrupt  register.  The 
"processing  interrupt"  mode  of  each  processor  can  be  interrogated 
by  the  PINT  instruction  of  the  coordinator.  In  one  case  the 
intent  is  for  the  processor  to  interrupt  the  coordinator;  in  the 
other,  the  processor  has  been  interrupted. 


C.9  INTERRUPTS 


Both  coordinator  and  processor  have  an  interrupt  register.  Pro- 
cessor interrupts  are  to  processor-resident  software,  for  logging 
recoverable  errors,  processor  software  will  return  to  user  proces- 
sing within  a few  microseconds.  For  non-recoverable  errors, 
processor  software  issues  an  interrupt  to  the  coordinator  in  order 
to  shut  down  the  entire  PMP.  In  the  processor,  the  list  of  inter- 
rupts is  (with  recoverable  interrupts  identified) ; 


C- 


Single  error  corrected  in  processor  memory  (recoverable) 

Double  error  detected  in  processor  memory 

Single  error  corrected  in  word  received  from  CN  buffer 
(recoverable) 

Double  error  detected  in  word  received  from  CN  buffet 

Parity  error  in  microprogram  word 

Memory  bounds  error 

Uninitialized  word  fetched  from  EM 

Unnormalized  floating  point  operand  detected 

Integer  overflow 

Divide  by  zero  integer 

Divide  by  zero  floating  point 

Error  detected  in  logic  operation  of  EO 

Software  generated  interrupt  (set  by  ICALLI)  (recoverable) 
Illegal  Op  Code 

Floating  point  overflow  and  underflow  ate  caught  by  changing  the 
word  to  "unpresentable"  (or  loosely,  "infinity")  and  "infinites- 
imal" . Divide  by  floating  point  zero  also  results  in  "unrepre- 
sentable", so  for  some  purposes  this  interrupt  would  be  masked  off 
as  redundant.  There  is  a control  bit  which  determines  whether 
integer  underflow  results  in  infinitesimal  or  zero.  The  single 
error  corrections  are  serviced  by  a routine  resident  in  the 
processor  which  logs  their  occurrence.  Return  is  to  the  user 
program.  Most  other  interrupts  will  result  in  program  termin- 
ation. It  is  the  design  intent  to  save  the  memory  address  and  the 
corrected  bit  number  for  error  corrections  and  the  memory  address 
of  double  error  detections. 

In  the  coordinator,  the  interrupt  register  has  the  following  bits: 
EM  module  error  EM  module  parity  error  data  in  (address) 

Single  error  detected  in  coordinator  memory 
Double  error  detected  in  coordinator  memory 
Single  error  detected  in  word  received  from  CN  buffer 
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Double  error  detected  in  word  received  from  CN  buffer 
Parity  error  in  microprogram  word 
Processor  interrupt  (sent  by  processor) 

All  processors  ready  (interrupts  system  software  to  get  user 
program's  instruction  executed) 

Memory  bounds  in  coordinator  memory 

Illegal  opcode 

Memory  bounds  in  EM 

Support  processor  interrupt 

DBM  result  descriptor  ready 

Diagnostic  controller  interrupt 

Timeout,  no  instruction  executed  for  the  last  X ms* 

Interval  timer  count  down  to  zero 
Integer  overflow 
Divide  by  zero 

Logic  error  detected  in  coordinator  operation 

DBM  controller  error  detected 

Software  generated  interrupt  (set  by  CCALLI) 

Unrecoverable  interrupts  enter  interrupt  processing  at  address  0. 
Recoverable  interrupts  (single  error  corrections  and  ICALLI  in  the 
processor,  in  the  coordinator  single  error,  CCALLI,  interval 
timer,  support  processor  interrupt)  enter  interrupt  processing  at 
a second,  hard-wired  address.  In  the  coordinator,  the  "all 
processors  ready"  interrupt  has  its  own  hard-wired  address. 
Processor  interrupts  interrupt  to  processor-resident  software? 
coordinator  interrupts  interrupt  to  coordinator-resident  software. 


C.IO  SUBROUTINE  ENTRY  AND  RETUEm 
A description  of  how  this  is  done. 
Environment 


Two  integer  registers  are  permanently  designated  as 


SB  The  pointer  to  the  "base"  of  the  address  space  for  the 
current  subroutine 

SL  the  pointer  to  the  limit  of  this  address  space 
(actually  points  to  the  first  word  beyond  the 
allocated  space). 

"S"  stands  for  "stack",  since  space  is  allocated  as  a stack. 

Pig.  C.2  shows  this  stack. 


C.10.1  Subroutine  Entry 

Prior  to  the  call,  the  variables  temporarily  held  in  registers 
must  be  stored  back  in  PM  if  there  is  any  chance  the  called  subrou- 
tine will  reference  them.  Registers  that  the  called  subroutine 
will  use  must  also  be  saved.  The  compiler  simply  stores  every- 
thing back  to  its  "home"  address  in  PM. 

At  the  place  pointed  to  by  SL,  the  caller  next  writes  any  para- 
meters passed  by  value  (where  this  is  allowed  in  our  FORTRAN) , and 
the  base  addresses  of  any  arrays  being  passed,  and  the  descriptors 
of  any  named  common  areas.  There  are  P words  in  this  area,  where 
P is  known  to  the  compiler. 

Next,  the  CALL  instruction  is  executed,  it  does  the  following; 

1.  The  content  of  SB,  SL,  and  program  counter  are 
concatenated  and  written  into  address  P+SL. 

2.  Register  SB  is  loaded  by  SL  + P 

3.  Register  SL  is  loaded  by  the  new  value  of  SB  plus  a 
literal,  the  space  allocation  known  to  the  compiler. 

CALL  therefore  has  two  parameters,  the  number  of  parameters 
passed,  and  amount  of  space  allocated.  In  ANSI  FORTRAN  77,  both 
of  these  would  be  literal  fields  in  the  instruction.  For  some  of 
the  dynamic  array  sizes  that  are  allowed  in  FMP  FORTRAN,  it  will 
be  necessary  to  insert  code  to  compute  the  size,  and  leave  it  in 
an  integer  register.  The  absolute  program  address  is  computed 
from  the  content  of  the  branch  address  field  and  inserted  into  PCR 
for  fetching  the  next  instruction. 

C.10.2  Subroutine  Return 


RETURN  executes  as  follows: 

1.  Fetch  the  word  addressed  by  SB 

2.  Unpack  that  word  into  SB,  SL,  and  the  program  counter. 


C-12 


COMMON 


C*3  Stack  Allocation  in  the  Data  Area 


DESCRIPTORS  FOR/C/ POINT  HERE 


DESCRIPTORS  FOR/B/PO'NT  HERE 


DESCRIPTORS  FOR /A/ POINT  HERE 
SECOND  STACK  LIMIT 


Figure  C.4  Organization  of  Named  Common 


If  the  subroutine  is  a function,  the  results  of  that  function  will 
be  left  in  a single-length  or  double-length  integer  or 
floating-point  register  as  appropriate*  The  register  is 

determined  by  convention,  and  is  the  same  for  all  functions  of  the 
same  type. 

C.10.3  Within  the  Subroutine 

Working  space  is  addressed  by  positive  offsets  relative  to  SB.  We 

have  12  bits  of  address  that  may  be  added  to  an  integer  register 

as  part  of  the  normal  addressing  machinery.  When  12  bits  is  not 
enough,  the  compiler  will  have  to  use  integer  instructions  to 
build  the  address. 

Parameters  and  base  addresses  of  named  common  areas  are  accessed 
by  negative  offsets  from  SB,  as  implied  in  the  description  of 

entering  a subroutine  in  C.10.1. 

C.10.4  Addressing 

With  the  above  structure,  absolute  addresses  may  be  used  for 
simple  variables  in  the  main  program  and  for  blank  COMMON. 
Varying  degrees  of  indirection  are  implied,  the  most  complicated 
case  being  an  element  of  an  array  in  a named  common  in  a 
subroutine,  where  an  offset  from  SB  is  used  to  find  the  base 
address  of  the  named  common,  an  offset  from  that  base  is  the  base 
of  the  array,  and  the  element  is  offset  from  the  array  beginning. 
(A  smart  compiler  may  combine  the  last  two  into  a single  offset 
and  will  fetch  the  base  address  of  a named  common  to  an  integer 
register  upon  its  first  use.) 

C.10.5  Named  Common  Mechanism 

A second  stack  of  space  is  allocated  to  all  the  named  commons.  If 
the  first  stack  grows  by  increasing  addresses,  the  second  stack 
may  grow  by  decreasing  addresses.  For  example,  see  Figure  C.3. 
At  address  zero  of  each  named  common  is  a count  of  how  many 
subroutines  are  currently  active  which  name  that  common.  Each  CALL 
goes  through  the  descriptors  in  the  parameter  area  and  increases 
each  count.  Each  RETURN  goes  through  the  descriptors  in  the  para- 
meter area  and  decreases  each  count.  A named  Common  used  in  all 
subroutines  lasts  the  entire  run,  therefore  as  does  blank  common. 
The  words  at  address  zero  also  contain  the  size  of  the  named 
common,  so  that  they  form  a relatively-addressed  linked  list  to 
each  other.  See.  Pig.  C.4.  Whenever  the  count  goes  to  zero  at 
the  last  named  common,  the  stack  limit  of  the  second  stack  is 
decreased  to  the  first  non-  zero  count. 

In  ALGOL,  where  addressing  environments  are  nested  in  lexic 
levels,  the  above  mechanism  always  releases  space  upon  the  exit 
from  the  last  lexic  level  that  needs  that  space.  In  FORTRAN,  if 
we  adopt  the  above  mechanism,  it  is  possible  to  undefine  a block 
of  space  inside  this  second  stack,  but  it  won*t  be  released  until 
the  spaces  ”above”  it  in  the  second  stack  are  themselves  released. 
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A named  common  disappears  whenever  no  subroutine  owning  it  is 
active*  A named  common  descriptor  will  either  be  found  in  the 

calling  subroutine,  upon  subroutine  entry,  or  must  be  created* 


Thus,  the  presence  of  the  appropriately  named  descriptor  in  the 
calling  subroutine  causes  the  descriptor  to  be  copied;  while  the 
absence  of  an  appropriately  named  descriptor  causes  new  space  to 
be  allocated,  and  a new  descriptor  to  be  created* 

A provision  for  statically  allocated  common  areas,  that  survive 
for  the  life  of  the  job,  can  easily  be  made  if  desired*  They  have 
been  omitted  from  this  description  because  such  statically  allo- 
cated variables  occupy  needed  space  during  times  that  they  are 
inactive,  and  because  such  static  allocation,  outside  of  blank 
common,  is  not  needed  for  compliance  with  FORTRAN  77.  In  the  3D 
implicit  code,  as  explained  in  Appendix  A,  the  maximum  mesh  size 
would  be  smaller  if  all  variables  were  statically  allocated* 

0*10*6  Arithmetic  Details 

The  design  intent  is  to  provide  perfect  rounding*  A floating 
point  number  is  a discretized  representation  of  an  assumed  under- 
lying real  number*  When  two  floating  point  numbers  are  combined, 
the  result  is  to  be  the  closest  representation  possible  of  the 
real  number  result  from  combining  the  two  underlying  real  numbers. 
Thus,  whenever  the  guard  bits  are  less  than  one  half  a least 
significant  bit,  the  surviving  part  of  the  mantissa  shall  be  left 
alone*  When  they  are  more  than  one  half  a least  significant  bit, 
one  is  to  add  1 to  the  surviving  part  of  the  mantissa*  In  the  FMP 
processor,  a full  double  length  accumulator  cannot  be  justified. 
Therefore,  when  the  eight  guard  bits  are  exactly  one  ONE  followed 
by  ZEROS,  they  may  represent  one  ONE  followed  by  seven  ZEROS, 
followed  by  additional  unknown  bits,  hence  we  round  by  adding  1 in 
the  least  significant  place  whenever  the  most  significant  guard 
bit  is  ONE*  Alignment  of  addends  is  done  in  one  clock,  with  a 
barrel,  hence  the  implementation  of  a "sticky”  bit  represents 
substantial  hardware  investment* 

Guard  bits  and  rounding  are  used  to  preserve  precisions  in  single- 
length arithmetic  (36  bit  precision,  but  not  in  double  length  (72 
bits  precision),  giving  roughly  11  decimal  digits  and  21  decimal 
digits  of  precision  respectively* 

Rounding  occurs  after  normalization.  Since  we  have  six-or-eight 
guard  bits,  rounding  is  a no-op  when  normalization  requires  a left 
shift  of  more  than  six-or-eight  places*  The  guard  bits,  shifted 
into  the  result  by  the  normalization,  protect  precision 
effectively. 
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Rounding  after  addition  can  be  simplified  by  observing  that  when- 
ever mantissa  overflow  occurs  after  the  rounding,  the  resulting 
mantissa  must  be  • 100000  . ♦ • However,  we  have  to  add  one  to  the 
exponent*  Since  the  exponent  adder  is  not  otherwise  busy  during 
rounding,  we  have  exponent  in  the  result  register  and  exponent  +1 
being  presented  at  the  output  of  the  exponent  adder,  so  that,  if 
rounding  overflows,  exponent  +1  is  loaded  into  the  result  exponent 
field,  while  .1000000000  is  loaded  into  the  mantissa,  all  without 
requiring  any  additional  clock. 

A zero  result  that  may  get  rounded  away  from  zero  is  a special 
case.  The  sign  and  exponent  of  an  apparently  zero  result  must  be 
saved  until  after  rounding  f to  accommodate  the  case  that  the 
result  will  be  rounded  away  from  zero.  All  zeroes  have  positive 
sign  and  the  smallest  allowable  exponent. 

In  multiply,  normalization  and  rounding  are  done  together  in  one 
clock.  A product  never  overflows,  and  normalization  is  by  either 
no  or  one  place.  Add  1/4  of  a LSB  to  the  product  on  the  last  cycle 
of  multiply  using  the  carry  input  to  the  second  guard  bit.  Thus, 
if  normalization  by  one  place  is  required,  the  product  is  already 
rounded.  If  normalization  is  not  required,  add  another  1/4  of  a 
IiSB.  Only  2 guard  bits  are  needed  at  the  end  of  multiply  (al- 
though more  are  needed  to  keep  the  partial  products  honest  during 
the  formation  of  the  final  product) . Thus  normalization  and 
rounding  take  one  clock  altogether.  At  that  last  clock:  norma- 

lize the  already-rounded  product  if  the  leading  bit  is  ZERO; 
select  the  output  of  the  adder  (Result  + 1/4  (LSB))  if  the  leading 
bit  of  mantissa  is  ONE. 

C.10.7  Other  Instructions 

Some  operations  are  implemented  as  simple  by-products  of  the 
instructions  in  Table  C.l.  By-product  instructions  include; 

Convert  from  single-precision  to  double-precision.  Given  by 
PADDXL,  literal«zero. 

Convert  from  double-precision  to  single-precision.  Address 
the  first  half  only  of  a double-precision  word. 

Divide  (multiply)  integer  by  power  of  two.  ISHN(L). 

Extract  fraction-part  from  floating-point  word.  FMOVEXL, 
literal  » zero.  Useful  in  mathematical  functions. 

Half-word  and  full-word  No-ops. 
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TABLE  C.l 

Processor  Instructions 


Floating 


PADD(M,L), 

PSUB(M,L) 

Reg^  plus  (or  minus)  operand  Reg3 

PMOL(M,L) 

Rsgi  times  operand  Reg3 

FDIV(M,L) 

Bieg^  divided  by  operand  Beg3 

FDVR(M,L) 

Operand  divided  by  Itegi  Reg3 

PMAD(M,L) , 
FSUB(M,L) 

Beg^  times  operand  is  added  to  (subtracted  from) 
Reg3  -*>  Iteg3 

FSSQ(M) 

Reg^  squared  plus  operand  squared  Reg3 

FADEX(L) 

Add  operand  (if  literal,  may  be  limited  to  8-bit  literal  in 
countfield)  to  exponent  field  in  fl.  pt.  reg.  If  operand 
is  in  register,  it  will  be  an  integer  register. 

PMOVEX(L) 

Transfer  operand  (from  int.  reg.,  or  literal)  to  exponent 
field  in  fl.  pt.  reg. 

(in  both  the  above  instructic«is,  the  operand  is  in  integer 
format  and  will  be  converted  to  floating  point  exponent 
format  during  the  course  of  the  instruction.) 

PABS(M) 

|operand|  -►  Reg 

PNBG(M) 

minus  the  operand  ->  Reg 

FADDX(M,L) , 
FSUBX(M,L) 

Regi  plus  (minus)  operand  Reg3  & next  (double-length) 

PMUIS 

Double  length  product  of  Regx  and  Reg2  Reg3  & next 

FADOD,  FSUBD 

Double  length  sum  (difference),  Regi  & next  (p)  Reg2 
& next  ->  Reg3  & next 

PMUID 

Double  length  product  of  two  doi±>le-length  operands.  Iteg^ 
& next  * Iteg2  & next  -9  Reg3  & next 

PL 

Ihe  48  bits  following  this  opoxxle  -♦  Reg3 

FMOWB(M,L) 

FPAKM 

FOPKM 

FSTORE 

FIX 

FDCD 

FIXF 


Fixe 


FINFLZ 

FIXEX 

FMP 


FMTI 


FUCKT 


FLT(L,M), 

FLE(M,L), 

PQT(L,M), 

PGE(L,M) 


ppetand  -*>  8093 

Itost  sig.  24  bits  of  Begx  s most  sig.  24  bits  of 
Reg2  ->'merooty 

From  memory,  the  roost  sig.  24  bits  of  manory  word  & 24 
zeroes  the  least  sig.  24  bits  of  memory  word 

& 24  zeroes  Beg2. 

Heg  ->  memory. 

Convert  operand  in  fl.  pt.  Iteg.  to  nearest  rounded 
integer  value  -*■  int.  Beg. 

Convert  operand  in  fl.  pt.  reg.  to  nearest  rounded  integer 
value  -■>  int.  Reg  & int.  Iteg.+i 

Convert  errand  in  fl.  pt.  regito  integer  vrtiose  absolute 
value  is  the  largest  possible  But  not  larger  than  the 
absolute  value  of  the  original  operand.  Result  ->  int. 
Beg2.  (floor) 

Convert  operand  in  fl.  pt.  Regi  to  integer  vdx)se  absolute 
value  is  the  smallest  possible  not  smaller  than  the 
absolute  value  of  the  original  operand.  Itesult  Int. 
Reg2.  (Ceiling) 

If  Regi  contains  "infinitesimal",  zero  -*  Rsgx 

Convert  exponent  in  fl.  pt.  reg.  to  integer  format  -♦  int. 
Reg. 

Convert  content  of  fl.  pt.  reg.  to  floating  point  format 
used  by  the  B-7800.  (Will  be  microprogranroed,  and  will 
use  logic  in  the  integer  unit.)  If  "unrepresentable"  or 
exponent  out  of  range,  internet. 

Content  of  fl.  pt.  register  is  assumed  to  be  in  B-7800 
floating  point  format,  and  is  converted  to  internal  EMP 
floating  point  format. 

Convert  integer  in  int.  Reg.  to  fl.  pt.  format  fl. 
pt.  Reg. 

If  Regx  tests  LT  (or  LE,  QT,  GE)  operand,  then  GOTO 
branchaddress. 
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FBQL 

If  1st  16  bits  of  Heg  equal  16-bit  literal,  GOTO 
brancheddress.  Ibis  yields  tests  for  zero,  "uninitialized" 
"infinitesimal"  and  "unrepresentable/inf inity" , since  these 
are  all  encoded  in  ghe  exponent  field.  No  floating-point 
v<ord  with  zero  exponent  is  allowed  except  zero  itself. 
(Tests  for  equal  in  floating  point  have  otherwise  been 
eliminated  as  useless  and  misleading.) 

FLTD 

Double-precision  compare.  If  Beg^  & next  is  less  than 
Beg2  S next,  GOTO  branchaddress.  (Reverse  registers 
for  .GE.) 

POT) 

If  Hegi  & next  is  greater  than  or  equal  to  reg2  & next, 
GOTO  branchaddress.  (Reverse  register  addresses  for  .LE..) 

SETFL 

Set  infinitesimal  control  bit.  Exponent  underflow  there- 
after results  in  "infinitesimal". 

S^Z 

Beset  infinitesimal  control  bit.  Bcponent  underflow 
thereafter  results  in  zero. 

Integer 

IADD(M,L) , 

Regi  plus  (minus)  operand  Reg3 

ISUB(H,L) 

lADDl,  ISUBl 

Rsgx  plus  (minus)  1 ->  Reg]^ 

IMUL(M,L) 

Regi  tiroes  operand  -■»►  Heg3 

IDIV(M,L) 

Regi  divided  by  operand  Beg3 

IM0D(M,I.) 

Regi  modulo  <^rand  Reg3 

INCX)521 

Itegi  & next  modulo  521-*Reg3.  (This  is  a ^cial,  fast, 
instruction,  as  it  is  needed  to  determine  EH  module  number 

from  EM  address.  Estimated,  4 clocks*)  (Note  the  absence 
of  IDIV521*  A Dlvt>i2  will  be  built  into  the  address  path 
to  CN  buffer  r taking  no  tin®*  Ihe  1*8%  holes  left  this  way 
in  memory  can  be  addressed  by  a different  set  of  address 
computations;  they  will  be  in  a logically  disjoint  address 
space*) 
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IAnDX(M,L), 

ISUBX(M,L) 

IMUtX(M,L) 
IDIVX(M,L) 
IMODX(M,L) 
IAM3D,  ISOBD 
ISH(C,S,N)(L) 

ISH(C,S,N)D(L) 

IOR(M,L) , 
IAND(M,L) , 

IXOR(M,L) 

INOT(M,L) 

IDIi 

IflDDL 

IMCVE(M,I.) 

IDMfVE(M,i,) 

IPNO 

IPAK3M 

IUPK3M 

IPAK3F 


Regi  & next  plus  (minus)  operand  -♦  Reg3  & next 
(Double-length  and  single-length  operands  ccnibined  into 
a doublelength  result) 

Rsgi  & next  times  operand  ->Reg3  & next 

Regi  & next  divided  by  operand  ->  Reg3  & next 

Regj  & next  modulo  operand  ->  Reg3 

Itegi  & next  plus  (minus)  Iteg2  & next  -*>  Reg3  & next 

Shift  tegi  end-around  (or  end-off,  or  numeric  with 
sign-bit  fill  if  right  or  zero  fill  if  left)  by  the  dis- 
tance shown  by  the  operand  (positive  is  shift  left,  to 
coincide  with  the  requirements  of  numeric  shifts) 

Shift,  as  above,  except  double-length.  Hegi 
& next. 

Reg^  OR  (or  AND,  implies,  exclusive  OR)  operand  -t>  Reg3 


NOT  operand  ->  Reg3 

Literal  ( 32  bits)  -»  Beg  & next 

Beg  s next  plus  literal  (32  bits)  -*»Beg  & next 

Operand  -►  Reg2 

Operand  -^Reg2  & next  (if  operand  is  register,  it  is  a 
double  length  register) 

Processor  number  (wired  into  backplane)  minus  "sparebit" 

^Jarebit  = 1 if  processor  above  the  spare  locaticxi, 
=0  if  processor  below.  Leading  two  bits  are  c^inet  number, 
and  are  not  involved  in  the  subtraction,  since  each  cabinet 
has  <xie  spare. 

Regi  & Beg2  & Reg3  memory 

Manory  ->  Beg^  & Reg2  & Beg3 

Regi  & Heg2  & Reg3  -»  fl.  p*;.  reg.  (because  of 
instruction  format  limitations,  not  all  three  int.  Beg.  will 
be  explicitly  addressed,  one  or  two  of  them  will  be  "next" 
int.  Beg. 
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IUPK3P 


PI. 


pt.  Reg.  Regi  & Reg2  & Reg3 


ISTORE 


32  zeroes  & Reg  memory 


s. 


IDSTORE 

IM(M,I,) , 
XLE(M,L)r 
IGT{M,L) , 
IGB(M,L) , 
IBQ(M,L) , 
XNB(M,L) 

IDLT,  IDCT, 
lOBQ,  IDNE 


IBIT(L) 


16  zeroes  <■  Reg  & next  memory 

If  Regx  test  LT  (or  LE,  OT,  GE,  BQ,  NE)  operand,  then 
GOTO  branchaddress 


If  Regi  & next  tests  LT  (or  GT,  BQ,  or  NE)  to  Reg2  & 
next,  then  GOTO  branchaddress.  (Reversal  of  registers 
provides  the  relations. GE.  and  .LE. .) 

If  any  bit  of  Reg  ANDed  with  operand  is  ONE,  QDTO 
branchaddress 


CN  Buffer 


ESrOR0<l 


ISTOBEM 

IDSTOREH 

I3ST0REH 


MSTOREM 


»*ait  for  CN  buffet  to  become  NOT  "busy".  Send  int. 

Reg.x  (EM  module  number)  and  int.  Iteg2  & next 

(EM  address)  to  CN  buffet  address  portion,  send  fl.  pt. 

Regx  to  CN  buffer  data  portion.  Mark  CN  buffet  "busy". 
(Following  this  instruction,  CN  buffer  will  story  "busy" 
until  an  acknowledge  is  received  from  the  EM  module,  and 
the  buffer  contents  transmitted.  Buffer  will  then  be  NCT 
"busy"  and  NOT  "full".  The  processor  instruction  execution 
does  not  wait  for  any  of  this  to  happen.) 

Same  as  FSTOBEM  except  substitute  int.  Regx  for  fl.  pt. 

Regx. 

Same  as  ESTOREM  except  substitute  int.  Rsgx  & next  for 
fl.  pt.  Itegx. 

Same  as  ESTOREM  except  substitute  int.  Regx  &Int  Iteg4 
& int.  Rags  for  fl.  pt.  Regx.  Format  limitations  will 
probably  force  the  use  of  implicit  addresses  for  Heg4  and 
Reg5.  Hiey  are  likely  to  be  the  next  two  after  Rsgx. 

Same  as  ESTOREM  except  substitute  memory  for  fl.  pt.  Itegx* 
(Note  the  asymmetry  between  STORBM  and  LOADEM.  In  LOADEM, 
the  selection  of  destination  is  separated  from  the  EM 
address  operation,  in  order  to  allow  the  compiler  to  optimize 
the  sequencing  of  instructions.  In  STOREM,  the  instructions 
ate  combined  in  order  to  save  code  space.) 


EMHEQ 


EMPIIil. 


LOADEM 


LOOCm 


FREM 


(PorTOrly  "PETCHEM”  and  **BDCST**r  but  with  the  new  CN  these 
initiating  actions  are  the  same  for  both,  i»e.,  just  one 
instruction)  Wait  for  **I  got  here”  to  be  reset,  if  up. 

If  CN  buffer  is  ”busy”,  wait  for  CN  buffer  to  become  NOT 
”busy”.  Raise  ”I  got  here”,  (Later,  data  will  arrive  in 
the  CN  buffer,  which  will  then  be  marked  ”F(JIjL”,  and  the 
data  can  be  read  by  any  of  the  -REM  instructions.  Depend- 
ing on  whether  the  coordinator  executed  a PE?TCHEM  or  a 
BDCST  instruction,  that  data  will  have  arrived  from  EM  or 
from  the  coordinator  itself.) 

(Formerly  ”HVST”  and  "SHIPIN”,  but  with  the  new  CN  these 
initiating  actions  are  the  same  for  both,  just  one 
instruction.)  If  CN  buffer  “busy”,  wait  for  NOT  “busy”. 

If  ”I  got  here”  is  up,  wait  for  ”go”  to  reset  it.  Raise 
”I  got  here”,  load  CN  buffer  (datapart)  from  fl.  pt.  Reg, 
and  set  CN  buffer  to  “busy”.  (Following  this  instruction, 
the  coordinator  will  set  the  CN,  to  a "broadcast”  condition 
if  HVST,  or  to  a "wraparound”  condition  if  SHIPCN,  and 
move  the  data  from  the  CN  buffer.  If  SHIPCN  is  in  the 
coordinator  instruction  stream,  then  the  compiler  will  have 
inserted  some  form  of  -REM  instruction  later  on  in  the 
processor  instruction  stream  to  read  the  now  "full”  CN 
buffer.  Other  sources  of  data  are  expected  to  be  used  so 
seldom  that  instructions  to  HVST  or  swap  data  to  and  from 
integer  registers  and  memory  are  judged  to  be  a waste  of 
decoding  complexity.) 

Send  Regx  (EM  module  no.)  and  Reg2  & next  (EM 
address)  to  CN  buffer,  after  waiting  for  TN  buffer  to 
become  NOT  “busy”.  CN  buffer  will  now  become  "busy" 
until  data  arrives  from  EM,  whereupon  CN  buffer  becomes 
"full”.  Fetch  next  instruction  without  waiting. 

Send  Regx  (EM  module  no.)  and  Ri^2  ^ 
address)  to  CN  buffer , after  waiting  for  CN  buffer  to 
becorne  NOT  "busy".  CN  buffer  now  becomes  "busy"  until 
data  arrives  from  EM,  whereupon  CN  buffer  becomes  "full". 
Processor  does  not  wait  in  this  instruction  beyond  the 
loading  of  the  CN  buffer . EM  module  will  set  the  least 
significant  bit  of  the  word  in  memory  to  ONE  after 
transmitting  the  previous  contents  to  the  CN  buffer. 

(Used  for  inter  processor  cooperation  via  EM  independently 
of  the  coordinator.) 

Wait  for  CN  buffer  NOT  "busy”.  If  now  NOT  "full”,  error 
interrupt.  CN  buffer  Fl.  pt.  Reg.  Mark  buffer  NOT 
"full"  and  NOT  "busy”. 


IREM 


Same  as  FREM  except  CN  buffer  Int.  Reg 


IDREM 

I3REM 

MREM 

mx 

ITIXl 

ITIXL 

IJUMP 

ICATiL 

IRBrrURN 

PUSH 

POP 

TOS(L) 

WAIT 

STOP 

HELP 

IIOT(L) 


Same  as  PREM  except  CN  buffer  Int.  Reg  & next 

Same  as  PREM  except  CN  buffer  Int  Reg  & next  & next 

Same  as  PREM  except  CN  buffer  ->  memory. 

First;  Regx  R*eg3  ^ Eegx*  Itien,  test  Regx 
against  Reg2f  test  for  greater  if  Reg3  is  positive, 
for  less  if  Reg^  is  negative*  If  test  succeeds,  GOTO 
branchaddress* 

Same,  except  an  implied  literal  value  of  +1  substitutes  for 

Regs 

Same,  except  an  actual  literal  substitutes  for  Reg3 
GOTO  branchaddress 

Subroutine  entry.  Push  subroutine  stack.  Parameters  and 
new  working  area  are  relative  to  the  new  stack  address 
pointer. 

Subroutine  return.  Pop  subroutine  stack. 

Push  subroutine  stack,  do  not  change  PCR  (diagnostics). 

Pop  subroutine  stack,  do  not  change  PCR  (diagnostic  use) 

Set  stack  address  pointer  and  the  word  pointed  to  new 
values  found  in  Regx,  Reg2  in  operand.  (Stack 
mechanism  involves  not  only  a stack  address  pointer,  but 
also  return  information  and  address  bound(s)  in  word  0 
relative  to  that  pointer.) 

Wait  for  reset  of  got  here”  (if  it  is  i^).  Raise 
”I  got  here”.  Wait  for  "go”  before  fetching  next 
instruction. 

Wait  for  reset  of  "I  got  here".  Reset  "enable".  Raise 
"I  got  here".  Insetting  of  "enable"  disables  all  further 
instruction  fetching. 

Same  as  STOP  plus  raise  "interrupt"  line  to  coordinator. 

Interrupt  register  AND  operand  — 1^3.  Interrupt  register 
AND  NOT  operand  interrupt  register.  Operand  is  1^93 
or  literal. 
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ISMASK(L) 

Interrupt  mask  register  OR  operand  -»>  lnterriQ>t  mask 
register 

IBMASK(L) 

Interrupt  mask  register  AND  operand  -p  interrupt  niask 
register 

ICALLI 

Biter  interrupt  mode 

IRETI 

Return  from  interrupt 
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BESETE 

HALT 

FILM 

FILLME 

FHiLR 

FILLRE 

AEDR 

READR 

REALM 


* 


Table  C.l,  part  2 

Processor  operations  induced  by  commands  issued  by  the 
coordinator . 

Iteset  "enable”  immediately.  Do  not  wait  for  current 
instruction  to  finish.  Reset  “busy"  in  CN  buffer. 

Reset  "I  got  here". 

Reset  "enable"  only.  Allow  current  instruction  to  conplete. 

Load  word  in  CN  buffet  into  processor  inanory.  Increment 
manory  address  by  1.  (MAR  has  previously  been  loaded) 

Same  but  conditional  on  "en^le". 

Load  register  from  CN  buffer.  Register  address  will 
follow  this  code  on  the  command  lines. 

Same  as  ftt.tr  except  conditional  on  "enable". 

iVidress  processor.  Iteset  "enable".  Check  contents  of 
CN  buffet  gainst  processor  number.  If  matches,  Set 
"enable" . 

Trananit  contents  of  register  to  CN  buffet.  Register 
address  will  follow  this  code  on  the  command  lines. 

Read  word  from  memory  and  transmit  to  CN  buffet.  Increment 
memory  address  by  1.  (itegister  addresses  will  include 
registers  not  addressible  by  the  address  fields  in  the 
processor  instruction  set,  such  as  PCR  and  memory  address 
register.) 
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ftrithmetic 


•EABLE  C.l,  part  3 
Cjoordinator  Instruction  Set 


\ 


CADD(N,L) , 
CSB(N,L) 

Regi  plus  (or  minus)  operand  Reg3*  (Note  the  use  of 

to  designate  coordinator  memory) 

CADDl 

plus  1 Reg  1 

CSOBl 

Reg^  minus  1 Reg^ 

cmol::i,l) 

Regx  times  operand  Reg3 

CDIV(N,L) 

Regi  DIV  operand  ->>Beg3 

CMOD(N,L) 

Reg^  module  operand  Reg  3 

a«»521 

Rggi  modulo  521  ->Reg3*  (Substantially  faster  than 
CMODL  with  literal  = 521) 

CSH(C,S,N)(L) 

Shift  end-around  (or  end-off  ^ or  numeric)  the  operand  in 
Regi  by  the  distance  shown  in  qperand  (reg2  or 
literal) 

CAND(N,L) , 
COR(N,L) , 
CIMP(N,L) , 
CXOR(N,L) 

Reg^  AND  ( OR,  implies,  exclusive  OR)  operand  — Reg3 

CNOr(N,I-) 

NOrr  operand  Reg 

CM0VE(N,Ii) 

Operand  Reg 

CDL 

32-bit  literal  ->Reg 

CADL 

Itegx  plus  32-bit  literal  Regx 

CSTORE 

Reg  ->manory 

CGT(N,L) , 
OGE(N,L), 
CLT(N,I,) , 
CUB(N,L) , 
CBQ(L,N), 
CNE(N,L) 

Test  Ifegx  for  CT  (or  GE,  LT,  LE,  BQ,  NE)  against 
operand,  if  test  is  true,  GOTO  branchaddress* 

CBIT(N,Ii) 

If  any  bit  of  Reg  AND  operand  is  ONE,  GOTO  branchaddress • 

C-28 


other  Branch  Oontrols 


CTIX 

CTIXl 

CTIXL 

GJUMP 

CCAUi 

CREIUBN 

CPOSH 

CPOP 

CTOS(L) 

CREPI 

CCALLI 

Other 

CLOADEM 


CrOCKEM 

CLOADEMN(L) 

CSTOREM 


First:  Regi  + 1^93  ^Heg3.  Ttien,  test  Reg^ 
against  Reg2r  test  for  greater  if  1^93  is  positive, 
for  less  if  1^93  is  negative.  If  test  succeeds,  GOTO 
branchaddress. 

Same,  except  an  implied  literal  value  of  +l  substitutes 
for  Reg3. 

Same,  except  an  actual  literal  substitutes  for  1^93 
GOTO  branchaddress 

Call  subroutine,  push  subroutine  stack, 

Return  from  subroutine,  pop  subroutine  stack. 

Same  push-stack  action  as  OCALL,  but  do  not  change  program 
counter , 

Sem&  pop-stack  action  as  CRETOISN,  but  do  not  change  program 
counter , 

Change  stack  pointer  by  loading  it  with  errand 
Return  from  interrupt 
Enter  interrupt  mode. 


Fetch  to  Reg^  from  EM,  EM  address  is  in  Reg2,  EM 
module  no,  is  in  Reg3.  (Separation  of  m address  and  EM 
module  no.  permits  accessing  of  both  address  spaces  within 
the  EM,  Note  that  the  "EM  address"  will  be  stripped  of 
its  last  9 bits  before  being  transmitted  to  the  EM  as  an 
"address  within  module",) 

Same  as  CLOADEM  except  that  the  EM  module  will  set  the 
least  significant  bit  of  the  word  in  memory  to  ONE  after 
fetching  the  word  sent  to  the  coordinator. 

Fetch  N words  frexn  EM:  store  bo  coordinator  memory. 

Memory  address  is  in  instruction,  m address  is  in  Reg2r 
EM  module  no,  is  in  ^693.  N is  Hegi  or  literal. 

Store  Regx  to  EM,  EM  address  is  in  iteg2/  module  no. 
in  Ete93, 
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CSK)REMN(L) 

CINT(L) 

CSMASK(L) 

CHMASR(L) 


Store  N words  frcwa  inemory  to  BM*  EM  address  is  in  Iteg2f 
module  no*  in  Beg3*  N is  Rsgj  literal  (actually, 
countfield) 

Interrupt  register  AND  operand  1^93*  Interrupt 
register  AND  NCa?  operand  ->  interrupt  register*  Operand  is 
Reg2  or  literal 

Interri:qpt  mask  register  OR  operand  interrupt  mask 
register* 

Interrupt  mask  register  AND  operand  interrupt  mask 
register*  Any  interri:pt  bit  so  unmasked  causes  interrupt 
when  ONE* 

(Note;  Ihe  instructions  in  the  coordinator  up  to  this 
point  represent  functionally  a subset  of  the  processor 
capabilities*  One  possible  impleittentation  of  them  would 
be  to  use  a copy  of  the  processor  as  nK)st  of  the  coordina- 
tor* We  believe  that  the  coordinator  needs  32-bit  integer, 
and  needs  more  integer  registers,  too  oftai  for  this  to  be 
a good  idea.) 


(Ihe  following  instructions  represent  coordinator  capa- 
bilities which  are  not  needed  in  the  processor*  Indeed, 
one  of  the  reasons  for  having  a separate  coordinator  is 
so  that  these  functions  need  not  be  replicated  512  times, 
once  per  processor,  nor  do  the  processors  require  the 
connectivity  to  the  points  (DBM  controller,  host,  etc*) 
that  these  functicxis  imply*) 

Processor  Cooperation 

PETCHEM  Prom  EM  address  in  Reg2,  and  EM  module  no*  in  Reg^,  cause 

the  given  M module  to  cycle,  and  the  result  broadcast  to 
the  CN  buffer  of  all  processors*  Start  of  instruction  will 
wait  on  "All  processors  ready”  and  ”go”  will  be  issued  at 
an  appropriately  delayed  time* 

SmPCN(L)  Walt  for  ”A11  processors  ready”*  Send  "wraparound”  coimiand 

to  CN  level  N,  vihere  N is  found  in  Reg  or  literal.  Send 
”go". 

LOOP  wait  for  "all  processors  ready"*  If  NOT  "any  processor 

enabled",  set  the  "enable"  bit  of  all  processors,  and  exit 
the  instruction*  If  "any  processor  enabled",  issue  "go" 
and  GOTO  branchaddress* 


* 
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SYNC  wait  for  "all  processors  ready",  issue  "go" 

BDCST  wait  for  "all  processors  ready".  Set  CN  to  "broadcast" 

mode,  last  48  bits  of  Beg  & nett  to  CN  buffers  of  all 
processors.  Issue  "go". 

BDCSIN  Wait  for  "all  processors  ready".  Set  CN  to  "broadcast" 

mode,  send  word  fetched  from  coordinator  memory  thru  CN  to 
eill  CN  buffers.  Memory  address  has  normal  address  format. 

HVST  Vfeit  for  "all  processors  ready".  Set  CN  to  "harvest" 

mode,  GOitents  of  all  CN  buffers  that  are  "full"  are 
combined  (ORred  is  acceptable;  the  actual  formula  for 
combining  is  logic  designer's  option)  and  transmitted  to 
the  last  48  bits  of  fegj^  & next. 

pint  If  "aiv  processor  in  interrupt  mode"  - GOTO  branchaddress 

Actions  imposed  on  Processors 

UBDCST  Send  N words  to  processor.  N in  Regx-  Words  taken  from 

successive  addresses  in  coordinator  manory  starting  at 
address  given  in  instruction.  (Processor  will  have 
previously  been  put  into  a waiting  or  NOT  "enable"d  state, 
and  its  MAR  loaded  with  the  storting  address  in  PM. ) 

UBDCSTE  Same,  except  acceptance  of  data  is  conditional  c«i  "enable" 

bit  of  processor. 

USETP  Send  extents  of  Begx  to  processor  register  vrtiose 

address  is  in  Ifeg2.  (OBed  for  initializing  PCR,  setting 
MAR,  as  well  as  for  transmitting  ordinary  data.) 

USKTPE  Same  except  conditional  on  "enable"  bit  of  processor. 

USETPO  Same  as  IBETP  except  that  "enable"  bit  of  all  processors  is 

turned  cm  at  end  of  instructiem. 

CKBTPBO  Same  as  USETPO  except  that  acceptance  of  data  is  conditional 

on  previous  state  of  enable  bit. 

HALIP  Reset  "enable"  of  every  processor. 

STOPP  Reset  "enable" , " I got  here" , and  "busy"  of  every  premessor . 

Processors  will  cease  executing  in  mid-instruction. 

TEST?  If  "all  processors  ready",  GOTO  branchaddress. 

READP  Word  from  processor  register  addressed  in  Regi  is  brought 

back  to  Reg2  and  the  register  following  Heg2. 


C-31 


BEADFH 


W&rd  from  processor  memory  (at  address  set  by  MAR  of 
processor)  is  brot^ht  back  to  aeg2  and  the  register  next 
after  Reg2. 

BEADFMN  N words  from  processor  memory,  starting  at  the  address 

found  in  this  instruction.  This  and  previous  two  instruc- 
tions are  conditional  on  the  "enable"  bit. 

PBX  wait  for  "All  processors  ready" . Send  ADDR  command  with 

contents  of  Iteg^  as  the  processor  number. 

TESTE  If  "ary  processor  enabled",  GOTO  branchaddress. 

SPABB  Change  the  designation  of  ^re  processor,  or  of  spare  EM 

module.  There  are  four  registers  designating  spare  proc- 
essor, and  four  registers  designating  spare  EM  module. 

These  registers  are  readable  with  the  cnoVE  instruction. 

SBICM(li)  Set  CN  controls  to  bit  pattern  found  in  register  (literal). 

This  command  modifies  CN  function  for  diagnostic 
purposes,  such  as  restricting  access  to  one  or  the  other 
sheet  of  a two-sheet  CN. 

m 

TIOM  Regi  & next  transmitted  to  DBM  ccaitroller  as  ccxitrol  word. 

STAIDS  Status  word  of  DBM  controller  fetched  into  Regi  & next 

(Status  will  be  same  as  control  word,  except  word  count 
will  be  decremented  to  current  state,  and  a field  of  status 
bits  may  have  been  changed  by  the  DB<  controller.  Pormat 
TBD.) 

TIC»  Regi  & next  transmitted  to  host-readable  register, 

interrupt  host. 

HOST  Itead  host  read  and  writable  register  into  Reg^  & next. 

SCIOCK  Transfer  Reg  to  real-time  clock.  Clock  decrements  at  a 

fixed  rate,  TBD,  causing  interrupt  vdien  it  decrem^ts  past 
zero.  Setting  the  clock  resets  the  intern^  bit,  if  cp. 

BCLOCK  Transfer  contents  of  real-time  clock  counter  to  Reg. 
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•EABLE  C.2 


PROCESSOR  INSTROCTIC^ 


Mnaionics 

Half 

or 

Full 

Vford 

Proc, 

Clock 

Count 

mt. 

©lit 

Busy 

F.P* 

XJnit 

Busy 

Mem. 

Busy 

Min. 

CF  Buf . 
Busy 

FADD,  FSOB 

*5 

6 

0-6 

FADDM,  PUSUBM 

1 

9 

0-1 

3-9 

0-3 

PAIBL,  ASUBL 

1 

6 

0-6 

FMUL 

h 

9 

0-9 

PMULM 

1 

10 

0-1 

3-12 

0-3 

FMCUi 

1 

9 

0-9 

FDIV/  FDVR 

^5 

44 

0-44 

PDIVM,  BWRM 

1 

47 

0-1 

3-47 

0-3 

FDIVL,  PDVRL 

1 

44 

0-44 

FMADr  FSUB 

h 

11 

0-11 

EMADM,  FSUBM 

1 

14 

0-1 

3-14 

0-3 

FMADL,  ESUBL 

1 

11 

0-11 

fSSQ 

?5 

21 

0-21 

FSSQM 

1 

24 

0-1 

3-24 

0-3 

FADSC 

2 

0-2 

0-2 

FADEXL 

k 

2 

0-2 

BMOVEX 

h 

2 

0-2 

0-2 

FMOVEXIi 

h 

2 

0-2 

FABSf  fnbg 

h 

1 

0-1 
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TABLE  C.2 


PROCESSOR  INSmiCTIONS  (Cont) 


Mnemonics  Half  Proc.  Int.  P.P.  Mem.  Min. 

or  Clock  tftiit  unit  Busy  CF  Buf. 

Full  Count  Busy  Busy  Busy 

word 


FABSM,  FNHGM 

1 

4 

3-4 

0-3 

FADDX  FSUBX 

h 

7 

0-7 

PAIXIXM,  PSUBXM 

1 

10 

0-1 

0-10 

0-3 

FADDXL,  FSUBXL 

1 

7 

0-7 

FMULX 

^5 

15 

0-15 

FADDD,  FSOBD 

h 

7 

0-7 

FWULD 

h 

22 

0-22 

FL  (48-bit,  only  1^5  word  format) 

Ih 

4 

1-4 

FMOVE 

h 

1 

0-1 

FMCVm 

1 

4 

0-1 

3-4 

0-3 

FMOVEL 

1 

1 

0-1 

FPAKM 

1 

9 

6-7 

0—6 

6-9 

FPUPKM 

1 

5 

0-1 

2-5 

0-3 

PSTORE 

1 

3 

0-1 

0-3 

0-3 

FIX,  PIXP,  Fixe 

55 

4 

3-4 

0-3 

PIXD 

55 

5 

3-5 

0-3 

FINFIiZ 

55 

1 

0-1 

FIXEX 

55 

3 

0-3 

0-1 

FMT 

55 

7 

0-7 
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TABLE  C.2 


PROCESSOR  INSTRUCTIONS  (Cont) 


Mnemonics  Half  Proc*  Int*  F.P.  Mem.  Min. 

or  Clock  unit  Oiit  Busy  CF  Buf . 

Pull  Count  Busy  Busy  Busy 

Word 


FMTI 

FLOAT 

FLT,  FLE,  FGT,  FGE  (branch) 
FLTO,  FLEM,  FOTM,  FGEM  (branch) 
FLTLf  FLELf  FGTLf  FGEL  (branch) 
FEQL  (branch) 

FLTD^  PGTD  (branch) 

SETFL,  SET2 

lADD,  ISUB,  lAODl,  ISUBl 

lADDM,  ISOBM 

lADDL,  ISUBL 

IMUL 

IMULM 

IMULL 

miV,  IMOD 

IDIVM,  IMCBM 

IDIVL,  imoh 

IMOD521 

lATOXr  ISUBX 


^5 

5 

0-5 

h 

3 

0-1 

0-3 

h 

2 

1-2 

0-2 

1 

5 

4-5 

3-5 

0-3 

1 

2 

1-2 

0-2 

1 

3 

2-2 

0-3 

% 

3 

2-3 

0-3 

h 

1 

0-1 

h 

1 

0-1 

1 

4 

0-4 

0-3 

1 

1 

0-1 

^5 

9max 

0-9 

1 

12max 

0-12 

0-3 

1 

9-inax 

0-9 

^5 

Iftnax 

0-16 

1 

19max 

0-19 

0-3 

1 

16max 

0-16 

h 

4 

0-4 

h 

2 

0-2 

TABLE  C.2 


PBOCBSSOR  INSTRUCTICWS  (Cont) 


ttianonics  Half  Proc.  Int.  F.P.  Mam.  Min. 

or  Clock  unit  Unit  Busy  CP  Buf. 

Pull  Count  Busy  Busy  Busy 

word 


lADDXM,  ISUBXM 

1 

5 

0-5 

0-3 

lADDXL,  ISUBXL 

1 

2 

0-2 

IMULX 

h 

ITmax 

0-17 

IMULXM 

1 

2Qmax 

0-20 

0-3 

IMDLXL 

1 

ITmax 

0-17 

iDivx,  mm 

is 

32max 

0-32 

IDIVXMp  IMODXM 

1 

35max 

0-35 

IDIVXLr  IMODXL 

1 

32max 

0-32 

lADDD,  ISUBD 

h 

2 

0-2 

ISH(C,SrN)(L) 

h 

2 

0-2 

ISH(C,SpN)D(L) 

h 

5 

0-5 

lORp  lAND,  IXOR,  IIMP 

h 

1 

0-1 

lOm,  lANDM,  IXORM,  IIMPM 

1 

4 

0-4 

0-3 

lORLr  lANDL,  IXORL/  IIMPL 

1 

1 

0-1 

myn,  imm,  iros 

k 

1 

0-1 

mym,  movem 

1 

4 

0-4 

0-3 

mmj,  IMWEL,  ITOSL 

1 

1 

0-1 

IDL,  lADL 

1 

2 

0-2 

IDMOVE 

is 

2 

0-2 
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Cont) 


Proc*  Int.  F*P.  Mem#  Min* 

Clock  Unit  Unit  Busy  CF  Buf 

Count  Busy  Busy  Busy 


TABLE  C.2 


PROCESSOR  INSTRUCTIONS  (Oont) 


Mnemonics  Half  Proc,  Int.  F.P.  Mem.  Min. 

or  Clock  Unit  Unit  Busy  CF  Buf. 

Pull  Count  Busy  Busy  Busy 

Word 


ISTOREM,  IDSTOREM 

*5 

3 

0-3 

0-9 

I3STOREM 

h 

3 

0-3 

0-6 

MSTOREM 

1 

4 

0-4 

0-3  1-8 

EMRBQr  EMPILL 

h 

1 

0-1 

lOADEM,  LOCBCm 

h 

3 

0-3 

0-12 

PREM 

h 

2 

1-2 

0-2 

IREM 

H 

2 

1-2 

0-2 

IDREM 

h 

3 

1-3 

0-3 

I3REM 

*5 

4 

1-4 

0-4 

MREM 

1 

4 

1-2 

1-4  0-4 

ITIX,  ITIXL,  ITIXL  (branch) 

1 

3 

0-3 

IJUMP  (branch) 

2 

0-2 

- 

ICAIiL,  IRETORNr  PUSH,  OPO,  IRETI 

30 

(TBD) 

WAIT,  STOP,  HELP 

h 

4 

0-4 

ICALLI 

*5 

1 

IIOT 

is 

2 

0-2 

IIOTL 

2 

0-2 

ISMASK,  IRMASK 

is 

1 

0-1 

ISMASKL,  IRMASKL 

1 

1 

0-1 
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COORDIKMOR  INSTRUCTIONS 


Mnemonics 


Half  Coord  Arith 
or  Clock  unit 
Pull  Count  Susy 
Word 


Mem.  Min. 
Busy  CN  Buf. 
Busy 


CADD,  CSUB,  CADDl,  CSUBl, 

h 

1 

0-1 

CSH(C,S,N)(L),  CAND,  COR, 
CIMP,  CXOR,  CNOT,  CMOVE, 
CTOS 

CADDN,  CSUBN,  CANIMl,  CORN,  CIMPN, 

1 

4 

1-4 

0-3 

CXORN,  CNOTO,  CMOVEN 

CADDL,  CSUBL,  CADL,  CDL,  CANDL, 

1 

1 

0-1 

CORL,  CIMPL,  CXORL,  CNOTL, 
CMOVEL,  CTOSL 

CMUL 

h 

16max 

0-16 

CMUM 

1 

19max 

3-19 

0-3 

CMULL 

1 

16max 

0-16 

CDIV,  CMOD 

h 

32nvax 

0-32 

CDIVM,  CMODM 

1 

35max 

0-35 

0-3 

CDIVL,  CMOJL 

1 

32nax 

0-32 

CMOD521 

h 

4 

0-4 

CSTORE 

1 

3 

0-3 

0-3 

CGT,  CGE,  CLT,  CLE,  (BIT  (branch) 

1 

3 

0-3 

com,  CGBN,  CLIN,  CLH4,  CBITN 

1 

6 

0—6 

0-3 

(branch) 

(»TL,  CGEL,  CLTL,  CLEL,  CBITL 

1 

3 

0-3 

(branch) 


I 

1 


COORDINKTOR  INSTRUCTIONS  (Cont) 


Mnemonics 

Half 

or 

Full 

Vtord 

Oootd 

Clock 

Count 

Arith 

ttiit 

Busy 

Mem.  Min, 
Busy  CN  Buf. 
Busy 

CBQ,  CNE  (branch) 

% 

4 

0-4 

CEQN,  CNEN  (branch) 

1 

7 

0-7 

0-3 

CEQL,  CNEL  (branch) 

1 

4 

0-4 

CTIX,  CTIXL,  CTIX±  (branch) 

1 

3 

0-3 

GJUMP  (branch) 

CCALL,  CRETURN,  CRUSH,  CROP, 
CRETI 

2 

30 

(TBD) 

1-2 

CLQADEM,  CLOCKEM 

k 

13 

0-13 

0-12 

CUDADEMN(L) 

1 

9+ 

12N 

0-{7 

+12N) 

13-(9  0-(9 
+12N)  +12N) 

CSTOREM 

h 

3 

0-3 

0-7 

CSTOREMN(L) 

1 

9N-6 

0- 

(9N-6) 

0-9N 

CINT 

h 

2 

0-2 

CINTL 

1 

2 

0-2 

CSMASK,  CRMASK 

h 

1 

0-1 

CSMASKL,  CRMASKL 

1 

1 

0-1 

FBTCHEM 

13 

0-3 

0-12 

SHIFQJ,  SHIPCNL 

h 

9 

0-3 

0-9 

LOOP,  PINT,  TESTR,  TESTE  (branch) 

h 

2 

0-2 
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COORDINMOR  INBTRUCTIOtK  (Cont) 


(fri^nonlcs  Half  Cootd  Arith  Mem. 

or  Clock  mit  Busy 

Pull  Count  Bu^ 

Word 


SYNC 

h 

2 

0-2 

BDCST 

h 

9 

0-5 

h 

9 

0-9 

BDCSTN 

1 

12 

3-8 

UBDCST,  UBDCSTE 

1 

7+6N 

0-(7 

+6N) 

USETP,  USETPE,  PBOC 

% 

9 

0-4 

OSETPO,  OSETPEO 

11 

0-4 

HALTP,  STOPP 

h 

1 

RBADP,  BBADPM 

h 

15 

0-15 

RBADHQI 

1 

15+ 

6N 

0-(15 

+6N) 

•nOM,  TIOH,  STATOS,  HOST 

Js 

2 

0-2 

SCIiOCK,  RCLOCK 

3 

(TBD) 

Min. 

CN  Buf. 
Busy 
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APPENDIX  D 


RELIABILITY,  AVAILABILITY,  AND  MAINTAINABILITY  PROGRAMS 

From  a system  engineering  viewpoint,  the  design  of  a reliable  and 
maintainable  digital  computer  system  encompasses  many  interdisci- 
plinary technical  trade-off  decisions*  This  appendix  describes 
the  computer  programs  called  DESIGN  and  CONFIGURE  which  have  been 
developed  by  the  Burroughs  Corporation  to  focus  attention  on 
critical  Reliability,  Availability,  and  Maintainability  (RAM) 
design  factors  that  have  been  repeatedly  observed  to  dominate  the 
frequency  of  abnormal  system  interruption  and  the  duration  of 

downtime  in  fault-tolerant  computer  systems*  In  analyzing  the  RAM 

characteristics  of  the  Flow  Model  Processor  (FMP),  the  DESIGN 
Program  was  used  to  pinpoint  critical  factors  pertinent  to  the 

failure,  repair,  and  recovery  processes  of  the  FMP  that  require 

concentrated  design  attention  as  the  design  progresses*  The 
CONFIGURE  program  was  used  to  predict  the  performance  of  the 
Support  Processor  and  File  Management  Subsystems* 

The  following  paragraphs  describe  the  DESIGN  and  CONFIGURE  pro- 
grams in  terms  of  the  computer  system  models  applied  to  the  FMP 
and  the  NASF  and  the  computations  performed*  Salient  theoretical 
and  practical  assumptions  associated  with  the  mathematical  model 
utilized  and  definitions  of  all  input  parameters  and  computed 
results  are  discussed  to  aid  in  understanding  the  analysis 
performed  and  interpreting  computed  results.  Definitions  of  terms 
used  in  this  appendix  are  presented  in  Section  D*4. 

D*1  COMPUTER  SYSTEM  MODEL 

Traditionally,  in  mathematical  analyses  of  repairable  redundant 
systems,  it  has  been  common  to  assume  that  system  failure  occurs 
due  to  the  depletion  of  hardware  resources  when  an  active  hardware 
element  fails  before  the  previously  active  redund|int  hardware 
element(s)  is  repaired.  Although  this  conventional  " failure  and 
repair  cycle  type  model  has  been  applied  successfully  to  investi- 
gate the  hardware  availability  aspects  of  certain  types  of  redun- 
dant systems,  it  has  been  of  little  practical  value  in  predicting 
the  operational  RAM  characteristics  of  fault-tolerant  computer 
systems  in  which  hardware  elements  operate  under  software  control* 

In  an  operational  environment,  the  failure  of  a computer  system  to 
operate  continuously  frequently  occurs  for  reasons  other  than  the 
depletion  of  hardware  resources  due  to  permanent  type  failures 
which  require  repair  actions*  Common  causes  of  computer  system 
interruption  and  downtime  include  intermittent  failures  and  the 
inability  to  automatically  recover  from  certain  single  critical 
hardware  failures.  Since  an  accurate  reliability  estimate  must 
take  into^  account  all  applicable  sources  of  system  interruption 
and  downtime,  to  the  extent  possible,  Programs  DESIGN  and 
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CONFIGURE  have  been  developed  to  treat  overall  computer  system 
behavior  in  terms  of  hardware  subsystems  operation  under  software 
control  as  depicted  in  the  availability  block  diagram  of  Figure 
D.l.  Prom  a reliability  point  of  view,  each  of  the  critical 
subsystems  in  Figure  D.l  must  operate  successfully  in  order 
sustain  proper  system  operation. 

As  shown  in  Figure  D.l,  any  number  of  independent  hardware  sub- 
systems operating  under  software  control  can  be  defined  to  take 
into  account  as  many  functions  as  required*  The  subsystem  model 
is  based  on  the  premise  that  if  a redundant  hardware  element 
fails,  the  particular  subsystem  involved  may  be  interrupted  for  a 
short  time  to  effect  reconfiguration.  After  a short  delay,  the 
subsystem  is  restored  to  operation,  and  continues  to  operate  while 
the  failed  hardware  element  is  being  repaired*  However,  if  more 
than  the  specified  allowable  number  of  hardware  elements  are  down 
for  repair  or  if  a critical  hardware  element  has  failed,  then  the 
subsystem  is  down  until  the  appropriate  repair  has  been  effected. 
As  shown  in  Figure  D.l,  the  failure  of  any  subsystem  breaks  the 
critical  success  path,  causing  the  system  to  fail. 


SUBSYSTEM  1 SUBSYSTEM  M 


Rl/Ni 


”m^m 


J 


SYSTEM 

(HARDWARE  OPERATING  UNDER  SOFTWARE  CONTROL) 


Figure  D.l.  Computer  System  Availability 
Block  Diagram 
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Computation  of  Mean  Up  Time  (MUT),  Mean  Down  Time  (MDT),  and 
Availability  for  each  of  the  specified  critical  elements  and 
subsystems  is  performed  using  the  mathematical  model  discussed  in 
the  following  paragraph.  System  MOT,  MDT,  and  Availability  are 
then  computed  based  on  the  successful  operation  of  all  subsystems 
using  conventional  methods.  The  assumption  associated  with  this 
system  decomposition  technique  requires  only  that  the  system  be 
composed  of  independent  subsystems ^ each  of  which  can  be  regarded 
as  having  two  possible  outputs  (working  and  failed)  and  that  it  is 
possible  to  identify  a certain  set  of  subsystem  states  as  "working 
states"  and  the  remaining  states  as  "failed  states". 

D.2  MATHEMATICAL  MODEL  FOR  THE  DESIGN  PROGRAM 

The  mathematical  model  employed  in  the  DESIGN  Program  is  a dis- 
crete-state continuous-time  model  called  a Markov  process.  As 
with  any  type  of  Markov  model,  the  underlying  assumption  of  this 
process  is  that  the  transition  probability  P^j  from  any  state  i to 
any  state  j depends  only  on  the  states  of  i and  j and  is 

completely  independent  of  all  past  states  except  the  last  one. 

The  transition  probabilities  must  obey  the  following  two  rules: 

- The  probability  of  a transition  in  time  At  from  one  state 

to  another  is  given  by  Z(t)At  where  Z(t)  is  the  hazard 
associated  with  the  two  states  in  question.  If  all  2i(t)*s 
are  constant,  as  assumed  herein,  the  model  is  called 

homogeneous. 

- The  probabilities  of  more  than  one  transition  in  time  At 
are  infinitesimals  of  higher  order  and  can  be  neglected. 

These  properties  and  assumptions  are  quite  widely  accepted  as 
being  appropriate  to  modeling  the  failure  and  repair  cycles  of 

computer  systems. 

D.2.1  Mar kov  Graphs 

Figure  D.2  is  a Markov  graph  dipicting  the  transitions  between 
states  for  each  of  the  subsystems  defined  in  the  Computer  System 
availability  Block  Diagram  of  Figure  D.l.  In  Figure  D.2,  shaded 
states  represent  subsystem  failure,  and  consequently  system  fail- 
ure since  all  subsystems  are  required  to  be  functioning  properly 
to  achieve  system  success. 

For  hardware  subsystems  operating  under  software  control,  the 
Markov  Graph  is  quite  complex.  Therefore,  the  simplified  Markov 
Graph  shown  in  Figure  D.3  will  be  described  first  as  an  intro- 
duction to  considering  the  chain  pertaining  to  the  depletion  of 
redundancy  shown  in  Figure  D.3. 
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D.2.2  Conventional  Failure  and  Repair  Cycle  Model 

The  Markov  Graph  shown  in  Figure  D.3  is  typical  of  conventional 
failure  and  repair  cycle  models  for  repairable  redundant  systems 
where  the  mechanism  for  removing  failed  hardware  elements  from  the 
system  and  replacing  repaired  hardware  elements  are  tacitly  as- 
sumed to  be  perfect*  As  shown  in  Figure  D.3  the  number  of  states 
in  the  Markov  Graph  is  a variable  since  each  subsystem  may  contain 
0,  1,  2,  3 or  more  active  redundant  hardware  devices*  If,  for 

instance,  a subsystem  contains  two  active  identical  devices  (n), 
only  one  of  which  is  required  to  be  operating  for  subsystem  suc- 
cess (R)  , state  L-fl  becomes  State  3 (a  DOWN  state)  which  termin- 
ates the  chain  since  L+l=N-R+2,  or  L=N-R+1. 

For  the  subsystem  with  one  redundant  device,  it  is  common  to 
hypothesize  that  at  least  two  device  failures  must  occur  before  a 
subsystem  failure  can  occur.  Normally,  the  Mean  Time  to  Repair 
(MTTR)  of  a device  is  very  short  compared  with  the  Mean  Time 
Between  Failures  (MTBF);  therefore,  many  allowable  device  failures 
are  expected  to  occur  before  a subsystem  failure  occurs  due  to  a 
second  device  failure  during  the  time  when  a failed  device  is 
being  repaired.  Thus,  on  the  surface  it  appears  that  tremendous 
gains  are  in  store  if  sufficient  redundancy  is  provided  for  criti- 
cal devices  in  each  subsystem  since  the  probability  of  a second 
failure  during  a repair  cycle  is  a rare  event. 
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FAILED 

DEVICES 


Rr  NUMBER  OF  DEVICES  REQUIRED  TO  BE  OPERATING  FOR  SUCCESS 
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X;  DEVICE  FAILURE  RATE 
jLt;  DEVICE  REPAIR  RATE 


Figure  D.3.  Simplified  Markov  Graph  for  Depletion  of  Redundancy 
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As  shown  in  Figure  D*3,  ”N**  subsystem  devices  are  operating  suc- 
cessfully in  State  1*  Therefore r the  rate  at  which  device  fail- 
ures occur  is  “N”  times  the  failure  rate  (X)  of  a single  device, 
since  all  devices  are  required  to  be  identical#  In  State  2 one 
device  is  failed  and  either  of  the  following  two  transitions  may 
occur : 


- The  failed  device  may  be  repaired,  at  rate/i,  before  a 
second  failure  occurs  and  placed  back  into  service, 
returning  the  subsystem  to  State  1, 

- A second  failure  may  occur  before  the  repair  is  complete, 
further  degrading  the  subsystem  to  State  3. 


In  State  2,  failures  occur  at  rate  (N-1)  times  ^ since  one  device 
is  already  being  repaired#  The  rate  at  which  failures  occur  in 
subsequent  states  diminishes  as  shown  in  Figure  D.3  until  the 
subsystem  contains  an  inadequate  number  of  hardware  devices  to 
sustain  acceptable  functional  operation.  Once  the  subsystem  is  in 
a failed  state,  it  is  assumed  that  operations  cease  and  no  addit- 
ional failures  occur. 

Since  program  DESIGN  is  intended  to  investigate  system  design 
potential,  an  ideal  support  environment  is  assumed  in  which 
replacement  spares,  trained  repairmen,  documentation,  test  equip- 
ment, etc.,  are  all  immediately  available  when  required.  As  shown 
in  Figure  D.3  the  rate  at  which  repairs  are  enacted  when  one 
device  is  failed  is fJL,  and  the  rate  at  which  repairs  are  enacted 
when  more  than  one  device  is  failed  is  The  underlying 

assumptions  for  the  coefficients  of are  that  only  one  repairman 
will  be  assigned  to  a failed  device  and  that  the  maximum  number  of 
repairmen  available  for  assignment  to  a failed  subsystem  is  two* 
Thus,  if  one  hardware  device  fails,  one  repairman  goes  to  work; 
only  when  two  or  more  devices  require  repair  are  both  available 
repairmen  busy.  Normally,  the  probabilities  associated  with 
degradation  to  states  where  more  than  one  or  two  devices  require 
service  simultaneously  are  very  small. 

D.2.3  Design  Model  for  Hardware  Elements  Operating  Under  Software 
Control 


There  are  several  critical  factors  involved  in  adding  redundant 
devices  in  computer  subsystems  which  tend  to  severely  reduce  the 
potential  benefits  of  hardware  redundancy.  First,  the  mechanism 
for  automatically  detecting,  isolating,  and  switching  failed 
devices  out  of  the  system  and  adding  repaired  devices  back  into 
the  system  is  a complex  interdisciplinary  design  problem.  Also, 
clocks,  controllers,  busses  and  interface  circuitry  between  hard- 
ware devices  tend  to  contain  Single  Point  Failure  Modes  (SPFM*s) 
which  cause  subsystem  failures  even  though  the  subsystem  is  not 
depleted  of  sufficient  hardware  resources.  Unlike  some  types  of 
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systems,  both  permanent  type  failures  which  require  a repair 
action  and  intermittent  type  failures  which  disappear  before  being 
isolated  must  carefully  be  considered  in  designing  computer 
systems  since  either  can  cause  abnormal  subsystem  interruption • 
For  continuous  operation,  the  problem  of  performing  scheduled 
maintenance  actions  becomes  an  important  consideration,  and  safe-- 
guards  are  necessary  to  prevent  accidental  system  interruption 
when  unscheduled  maintenance  actions  are  being  performed  on  failed 
devices  which  cannot  be  physically  disconnected  from  the  system. 

Referring  to  the  Markov  Graph  for  hardware  operating  under  soft- 
ware control  in  Figure  D.2,  it  can  be  seen  that  the  center  portion 
labeled  “Depletion  of  Redundancy"  corresponds  closely  to  the 
previously  discussed  simplified  Markov  Graph.  As  before,  the 
number  of  states  is  a variable  depending  upon  how  many  redundant 
devices  are  provided  in  the  subsystem.  Failure  states  for  perman- 
ent and  intermittent  type  failures  related  to  the  recovery  process 
and  SPFM's  are  organized  in  line  with  the  labels  on  the  right-hand 
side  of  Figure  D.2.  State  5L+1  in  Figure  D.2  provides  for 
considering  scheduled  maintenance  actions  in  systems  where 
continuous  operation  is  desired,  and  consideration  of  maintenance 
errors  during  unschedul'^d  maintenance  actions  is  factored  into 
states  where  repair  actions  are  being  performed  while  the  system 
is  still  operating. 

Considering  first  only  the  depletion  of  redundancy,  the  Markov 
Graph  for  hardware  subsystems  operating  under  software  control  is 
based  on  the  premise  that  if  a redundant  hardware  device  fails, 
the  particular  subsystem  involved  may  be  interrupted  for  a negli- 
gible time  to  effect  automatic  reconfiguration.  After  automatic- 
ally decommitting  the  failed  device,  the  subsystem  is  immediately 
restored  to  operation,  and  the  failed  hardware  device  is  then 
repaired.  When  repair  of  the  failed  hardware  device  is  completed, 
it  is  recommitted  to  the  subsystem  without  any  discernable  inter- 
ruption in  subsystem  service.  However,  if  more  than  the  specified 
allowable  number  of  hardware  devices  are  down  for  repair,  or  if  a 
critical  hardware  device  has  failed,  then  the  subsystem  is  down 
until  the  appropriate  repair  has  been  effected. 

As  previously  discussed,  the  five  primary  sources  of  system  inter- 
ruption and  downtime  diagrammed  in  Figure  D.l  for  hardware  operat- 
ing under  software  control  are: 

- Depletion  of  Adequate  Resources 

- Unsuccessful  Recovery  (Intermittent) 

- Unsuccessful  Recovery  (Permanent) 

- Single  Point  Failure  Modes  (Intermittent) 

- Single  Point  Failure  Modes  (Permanent) 
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starting  in  State  1,  "N"  identical,  independent  subsystem  devices 
are  operating  successfully.  Therefore,  the  rate  at  which  failures 
occur  (either  permanent,  intermittent,^!)  is  N times  the 

failure  rate  of  a single  "device.  If  no  redundancy  is  provided, 
any  failure  causes  a subsystem  failure.  With  redundancy,  when  a 
failure  occurs  in  State  1,  any  of  the  following  transitions  may 
occur. 


- If  the  failure  is  a permanent  type  failure  related  to  an 
SPPM,  no  recovery  is  possible  and  there  is  a transition  to 
State  2L+1,  which  requires  a repair  action  to  reestablish 
subsystem  operation.  The  rate  at  which  SPFM  repairs  are 
enacted  is^pQ.  When  the  repair  is  completed  in  State 
2L+1,  there  is  a transition  back  to  State  1,  where  the 
subsystem  is  again  operating  successfully  with  all  hardware 
devices  present. 

- If  the  failure  is  an  intermittent  type  failure  to  an  SPFM, 
no  recovery  is  possible  and  there  is  a transition  to  State 
4L+1,  Since  intermittent  failures  do  not  require  a repair 
action,  the  device  is  returned  to  the  subsystem  and  there 
is  a transition  from  State  4I.+1  back  to  State  1 at  a rate 

which  is  the  device  manual  recovery  rate  including  the 
time  required  for  the  intermittent  failure  to  disappear. 

- If  the  failure  is  a permanent  type  failure  and  the  auto- 

matic recovery  system  is  successful,  there  is  a transition 
to  State  2,  in  which  case  the  subsystem  continues  to 
operate  with  one  device  decommitted  from  the  subsystem.  If 
no  additional  events  occur  before  the  failed  device  is 

repaired  and  recommitted  to  the  subsystem,  there  is  a trans- 
ition back  to  State  1.  These  transitions  occur  at  reteiip, 
which  is  the  device  repair  rate  for  permanent  type  fail- 
ures. Additional  events  in  State  2 will  be  discussed 
subsequently. 

- If  the  failure  is  a permanent  type  failure  and  the  auto- 
matic recovery  system  is  unsuccessful,  there  is  a 

transition  to  state  L+2.  In  State  L+2,  manual  recovery 
procedures  are  enacted  at  rate>D  and  there  is  a transition 
to  State  2 where  the  subsystem  is  operating  with  one  device 
decommitted  from  the  subsystem.  As  indicated  above.  State 

2 will  be  discussed  subsequently. 

- If  the  failure  is  an  intermittent  type  failure  and  the  auto- 
matic recovery  system  is  unsuccessful,  there  is  a 

transition  to  State  3L+1.  Again,  since  intermittent  type 
failures  do  not  require  a repair  action,  the  device  is 

returned  to  the  subsystem  and  there  is  a transition  back  to 
State  1 at  rate 


- For  systems  which  are  required  to  operate  coucinuously , 
State  1 provides  the  best  opportunity  for  performing  any 
required  scheduled  maintenance  on  subsystem  hardware  devices. 
Therefore,  scheduled  maintenance  is  restricted  to  being 
performed  only  when  all  subsystem  hardware  devices  are  oper- 
ating successfully.  When  a hardware  device  is  decommitted 
for  scheduled  maintenance,  there  is  a transition  to  State 
5L+1  where  the  subsystem  is  operating  successfully,  but 
depleted  of  one  of  its  hardware  resources.  If  the  scheduled 
maintenance  is  completed  (rate  and  returned  to  the  sub- 

system before  an  event  occurs,  there  is  a transition  back  to 
State  1.  However,  if  an  event  occurs  before  the  scheduled 
maintenance  action  is  completed  any  of  the  following  may 
occur: 


- To  State  of  2L+2  if  the  event  is  related  to  permanent 
type  SPPM  (Subsystem  DOWN) 

- To  State  L+3  if  the  event  is  related  to  a permanent  type 
failure  and  automatic  recovery  is  unsuccessful  (Sub- 
system DOWN) 

- To  State  3 if  the  event  is  related  to  a permanent  type 
failure  and  automatic  recovery  is  successful  (Subsystem 
UP)  provided  the  subsystem  contains  two  or  more  redun- 
dant devices,  otherwise.  State  3 is  a DOWN  state. 


In  State  2,  subsystems  with  one  or  more  redundant  devices  are 
operating  successfully  with  one  failed  device  decommitted  from  the 
subsystem.  Therefore,  the  rate  at  which  events  related  to  the 
number  of  hardware  devices  occur  diminishes  to  a multiplier  of 
N-1.  The  subsystem  operates  essentially  as  described  for  State  1 
except  that  the  failed  device  being  repaired  may  be  a hazard  to 
subsystem  operation.  If  the  failed  device  is  not  disconnected 
from  the  subsystem  and  safeguards  are  inadequate,  a maintenance 
error  could  occur  which  brings  the  subsystem  down.  In  State  2, 
the  transition  from  state  2 to  to  State  L+2  accounts  for  this 
potential  mode.  As  shown,  the  rate  at  which  catastrophic 
maintenance  errors  occur  is  designated  as 

D.3  MATHEMATICAL  MODEL  FOR  THE  CONFIGURE  PROGRAM 

In  contrast  to  the  traditional  two-state  failure  and  repair  cycle 
reliability  model,  the  CONFIGURE  program  employs  the  three-state 
model  shown  in  Figure  D.4  which  enables  the  effects  of  m<nual 
recovery  from  non-permanent  failures  and  errors  to  be  taken  into 
consideration.  This  separation  of  repair  and  nonrepair  events  is 
the  key  to  modeling  the  effects  of  intermittent  failures,  software 
errors,  maintenance  errors,  unisolated  events,  and  unsuccei.s tul 
automatic  recoveries  in  close  approximation  to  the  physical  system 
being  analyzed. 


As  shown  in  Figure  D.4f  when  an  elemenl  fails^  a transition  occurs 
from  the  UP  state  into  either  the  REPAIR  state  or  the  INTERRUPT 
state.  The  rate  at  which  these  transitions  occur  is  the  reci- 
procal of  the  element  Mean  Up  Time  (1/MUT).  Variable  FI  defines 
the  fraction  of  failures  or  errors  which  cause  a transition  direct- 
ly into  the  INTERRUPT  state.  Hence,  (1-Pl)  is  the  fraction  of 
failures  that  cause  a transition  directly  into  the  INTERRUPT 
state.  Hence,  (1-Fl)  is  the  fraction  of  failures  that  cause  a 
transition  directly  into  the  REPAIR  state. 

Once  an  element  is  in  the  REPAIR  state,  the  only  possible  transi- 
tion is  to  the  UP  state.  The  rate  at  which  this  transition  occurs 
is,  of  course,  the  reciprocal  of  the  element  Mean  Repair  Time 
(1/MRT).  When  an  element  is  in  the  INTERRUPT  state,  transitions 

to  either  the  REPAIR  state  or  the  UP  state  can  occur.  Variable  P2 
is  the  fraction  of  total  interrupt  events  which  go  Into  the  REPAIR 
state  rather  than  going  directly  into  the  UP  state,  and  the  reci- 
procal of  the  Mean  Interrupt  Time  { 1/MINT)  is  the  rate  at  which 
these  transitions  occur. 

For  hardware  subsystems,  the  subsystem  is  considered  to  be  oper- 
ating successfully  if  every  element  is  in  the  UP  state,  and  the 
subsystem  is  considered  to  be  down  if  any  element  causes  a transi- 
tion to  the  INTERRUPT  State.  The  subsystem  can  be  operating 
suocesfully  with  some  of  the  hardware  elements  in  the  REPAIR 
state.  This  depends  on  the  definition  of  how  many  hardware  ele- 
ments of  the  subsystem  can  be  in  the  REPAIR  state  witlj  the  sub- 
system still  capable  of  performing  its  intended  runctir»i.  For 
critical  subsystems,  the  subsystem  is  considered  to  be  operas 
ing  successfully  only  when  no  repair  action  or  interruption  are  in 
process.  Thus,  for  unisolated  events,  operator  errors,  and  main- 
tv?nance  errors  which  require  no  repair  action,  transitions  from 
the  UP  state  go  directly  irxCL  the  INTERRUPT  state.  In  this  case, 
restoration  of  system  operation  is  accomplished  by  a manual  reco  - 
ery  action.  Software  errors  which  disappear  follow  this  same 
pattern.  However,  if  a software  patch  is  required,  the  repair 
state  becomes  involved  in  a manner  analogous  to  the  situation 
discussed  with  respect  to  permanent  type  hardware  failures. 

The  summary  table  provided  below  the  transition  diagram  outlines 
conditions  in  states  Sx , S2/  and  S3,  and  defines  the  type  of 
recovery  required  for  the  specified  conditions. 

The  critical  assumptions  associated  with  the  derivation  of  the 
state  probability  equations  shown  in  Figure  D.4  are: 

Failure  and  repair  hazards  are  assumed  to  be  constant, 
which  is  equivalent  to  stating  that  individual  elements  are 
assumed  to  fail  in  accordance  with  the  negative  exponential 
distribution,  and  the  times  to  repair  are  also  exponentially 
distributed. 
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TRANSITION  DIAGRAM 
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Figure  D.4*  Three-State  Model 
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D.4  DEFINITIONS 

The  definitions  summarized  below  are  provided  for  reference  to  aid 
in  interpreting  computed  results. 

D.4.1  Program  Inputs 

The  following  definitions  pertain  to  Program  inputs.  Each  defi- 
nition describes  a program  variable.  The  symbol  used  is  given  in 
brackets  following  the  definitions.  In  cases  where  these  symbols 
are  reciprocals  of  the  various  transition  rates  discussed,  an 
equivalence  relationship  is  given  which  correlates  the  symbols  in 
Figure  D.2  with  the  input  data  symbols. 

- DEVICES  REQUIRED.  Minimum  number  of  identical  subsystem 
devices  required  to  be  working  for  acceptable  subsystem 
operation  (R). 

- DEVICES  AVAILABLE.  Number  of  identical  subsystem  devices 
provisioned  for  active  subsystem  operation  (N). 

- TIME  BETWEEN  FAILURES  ( PERMA*'^NT) . Time  interval  from  an 

instant  when  a repairable  device  is  working  to  the  next 
intermittent  type  device  failure  which  requires  a manual 
recovery  action  (mean:  MTBF(P)  ■=  I/\p)  ♦ 

- TIME  BETWEEN  FAILURES  (INTERMITTENT) . Time  interval  from 

an  instant  when  a repairable  device  is  working  to  the  next 
intermittent  type  device  failure  which  requires  a manual 
recovery  action  (mean:  MTBF(I)  = I/^i). 

“ SINGLE  POINT  FAILURES* . Percentage  of  total  failures  in  a 
redundant  subsystem  configuration  which  result  in  subsystem 
failures  (permanent  or  intermittent)  even  though  an  ade- 
quate number  of  devi-:es  are  working  (SPFM  = P) . 

- DEVICE  REPAIR  TIME.  Time  interval  from  an  instant  when 

repair  of  a device  is  initiated  to  readiness  as  an  active 
subsystem  device,  excluding  waiting  times  for  repairmen, 
spares,  etc.  (mean:  DRT  = l^p  ). 

- SINGLE  POINT  REPAIR  TIME* . Time  interval  from  an  instant 

when  repair  of  a single  point  failure  is  initiated  to 
readiness  of  associated  subsystem,  excluding  waiting  times 
for  repairmen,  spares,  etc.  (mean:  SRT  = l^PC) . 


- RECOVERY  EPPICIENCY  (PERMANENT).  Percentage  of  automatic 
recovery  actions  from  permanent  type  failures  which  are 
completed  successfully  (without  manual  intervention)  within 
a negligible  period  of  time  (RB(P)  “dp) ♦ 

- RECOVERY  EPPICIENCY  ( INTERMITTENT) . Percentage  of  auto- 
matic recovery  actio'ni  from  Intermittent  type  failures 
which  are  completed  successfully  (without  manual  inter- 
vention) within  a negligible  period  of  time  (RE(I)  -di) • 

- DEVICE  MANUAIi  RECOVERY  TIME«  Time  interval  from  an  instant 
when  a system  failure  related  to  an  unsuccessful  automatic 
recovery  from  a device  failure  (permanent  or  intermittent) 
occurs  until  the  system  is  restored  to  normal  operation  via 
manual  recovery  procedures  (mean:  DMRT  = V/p) . 

- TIME  BETWEEN  MAINTENANCE  ERRORS.  Time  interval  from  an 
instant  when  a system  recovery  related  to  a maintenance 
error  is  completed  until  the  next  occurrance  of  a main- 
tenance error  which  causes  system  interruption  (mean:  MTBMB 
= !/(<). 

- TIME  BETWEEN  PREVENTIVE  MAINTENANCE  ACTIOMS.  Time  inter- 

val from  an  instant  when  a device  has  been  recommitted  to 
active  operation  following  a scheduled  preventive  mainten- 
ance action  unti  the  next  scheduled  preventive  maintenance 
action  is  due  (mean:  MTBPM  = l/^M) 

- TIME  TO  PERFORM  PREVENTIVE  MAINTENANCE,  Time  interval  from 

an  instant  when  a device  is  decommitted  from  active  oper- 
ation (mean:  MTTPM  « 1>/<m)  • 

- SYSTEM  MANUAL  RECOVERY  TIME.  Time  interval  from  an  instant 
when  a system  failure  related  to  a transient  software  or 
operator  error  occurs  until  the  system  is  restored  to 
normal  operation  via  manual  recovery  procedures  (mean: 
SMRT  = l/Js) . 


* This  variable  is  provded  for  convenience  in  preparing  inital 
estirotes;  if  desired,  SPFM's  can  be  modeled  as  a single  non- 
redundant  device  in  series  with  the  associated  subsystem. 
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D.4.2  Program  Outputs 

The  following  six  useful  measures  of  system  performance  are  fre- 
quently encountered  in  modeling  and  analyzing  repairable,  redun- 
dant system  configurations: 

- Availability 

- Mean  Up  Time  (MUT) 

- Mean  Down  Time  (MDT) 

- Mean  Cycle  Time  (MCT) 

- Mean  Time  to  First  Failure  (MTFF) 

- Mean  Time  to  Failure  (MTTP) 

Although  the  terminology  given  above  is  common  in  the  field  of 
reliability,  the  precise  meaning  of  the  terms  can  easily  become 
confused.  The  following  definitions  of  these  terms  appear  in  the 
paper  by  Buzacott  [1],  These  definitions  have  been  extracted 
almost  directly  since  the  unified  presentation  in  the  Buzacott 
paper  tends  to  relieve  much  of  the  misunderstanding  encountered 
regarding  the  various  mean  times  of  interest  and  the  concept  of 
point  and  interval  availability  in  treating  repairable,  redundant 
systems.  Only  the  MUT,  MDT,  and  interval  availability  definitions 
are  directly  applicable  to  the  outputs  of  the  DESIGN  program.  The 
remaining  definitions  are  provided  for  reference  to  further  clar- 
ify the  precise  meanings  of  MUT,  MDT,  and  interval  availability. 

D.4.2.1  System  Availability 

Let  SYSTEM  INITIATION  be  the  instant  when  system  operation  begins 
for  the  first  time.  The  system  and  all  of  its  components  are 

assumed  to  be  working  correctly  and  not  to  be  subject  to  wearout 
during  the  time  interval  of  interest. 

Let  SYSTEM  FAILURE  be  the  instant  when  the  system  changes  from 

working  to  failed.  Let  SYSTEM  REPAIR  be  the  instant  when  the 
system  changes  from  failed  to  working. 

The  first  group  of  definitions  applies  to  the  concept  of  avail- 
ability. POINT  AVAILABILITY  (at  time  t) : The  probability  that 

the  system  is  working  at  time  t from  system  initiation.  it  is 

assumed  that  at  time  t no  information  is  available  about  system 
failures  and  system  repairs  during  the  time  interval  (0,  t)  (sym- 
bol Pv,(t)).  INTERVAL  AVAILABILITY  (at  time  t)  ; The  expected 

proportion  of  the  time  interval  from  system  initiation  (time  0)  to 
time  t during  which  the  system  is  working,  (symbol  I(t)).  Hence 


Kt) 


t 


(u)du 


D*1 
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It  is  assumed  in  calculatin9  sll  quantities  except  the  mean  of  the 
time  to  first  failure  that  the  system  has  been  operating  long 
enough  for  initial  effects  to  have  died  away;  i.e.,  statistical 
equilibrium  (or  the  asymptotic  behavior)  has  been  reached.  The 
mean  cycle  time  is  then  obtained  from  the  ratio  of  the  length  of 
some  long  time  interval  to  the  number  of  failures  N in  that  time 
interval,  i.e., 


and 


AV  = Availability 


MUT 

MCT 


Asymptotically,  for  large  t,  it  can  be  shown  that  point  availa- 
bility is  numerically  the  same  as  interval  availability.  Thus 


A = 


This  steady-state 
herein. 


lim  P (t)L 

litn 

^ -400 

availability 

is 

D.4.2.2  System  Time  Interval  Between  Failures 

The  next  group  of  definitions  refer  to  the  concepts  related  to  the 
time  interval  between  failures.  Each  definition  defines  a random 
variable.  The  symbol  that  will  be  used  herein  for  the  mean  is 
given  in  parentheses  following  the  definition. 

- UP  TIME.  Time  interval  from  system  repair  to  next  system 

failure  (mean:  MUT). 

- DOWN  TIME.  Time  interval  from  system  failure  to  next 

system  repair  (mean:  MDT) . 

- CYCLE  TIME.  Time  interval  from  one  system  failure  to  the 

next  system  failure.  The  cycle  time  is  the  sum  of  an  Up 
Time  and  a Down  Time  (mean:  MCT) . 

- TIME  TO  FIRST  FAILURE.  Time  interval  from  system  initi- 
ation (mean:  MTTF) . 
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The  first  three  time  intervals,  up  time,  down  time,  and  cycle  time 
apply  to  a system  that  is  alternately  working,  failed,  working, 
failed,  and  so  on.  The  system  is  said  to  be  repaired  when  suffic- 
ient, but  not  necessarily  all  components  are  repaired  so  that  the 
system  is  working. 
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APPENDIX  E 

FMP  RELIABILITY  DATA  BASE 

This  Appendix  presents  the  predicted  results  of  the  MTBP*s  and 
failure  rates  for  the  elements  of  the  PMP  system  (see  figures  E,l, 
E.2,  and  E.3*)*  Three  sets  of  results  are  shown,  the  difference 
being  the  failure  rate  and  SECDED  improvement  factors  used  for  the 
LSI  memory  circuits. 

The  form  of  these  predictions  is  a hierarchical  structure.  Each 
PMP  element  is  listed  as  a level  01;  the  parts  constituting  the 
elements  are  listed  as  level  02*  For  a better  defined  system,  the 
number  of  levels  may  increase,  showing  the  assemblies  that  make  up 
an  element  as  level  02,  the  subassemblies  as  level  03  and  the 
components  in  the  subassemblies  as  level  04,  etc. 

For  each  item  listed  in  an  element,  a part  number  and  description 
are  provided.  For  the  PMP,  hypothetical  part  numbers  are  used. 
In  the  case  of  the  I.C.*s,  typical  parts  in  the  generic  family 
assumed  have  been  selected  to  represent  all  the  I.C*s  used  in  the 
various  logic  functions. 

The  quantity  of  each  part  in  the  structured  listing  is  shown*  The 
quantity  is  multiplied  by  the  quantity  of  the  encompassing  level 
item*  For  example,  where  a quantity  of  2 of  an  assembly  or 
element  is  listed  and  a quantity  of  3 of  a particular  part  in  an 
assembly  or  element  is  required,  the  quantity  listed  for  this  part 
will  be  6. 

Failure  rates  for  each  individual  part  and  the  total  quantity  of 
that  part  are  shown  and  expressed  in  failures  per  million  hours 
(PPMH).  The  aggregate  of  these  failures  are  used  to  predict  the 
failure  rate  of  the  element  and  the  mean  time  between  failures 
(MTBP)  in  hours.  The  failure  rates  used  except  for  the  LSI  memo- 
ries have  been  developed  from  the  guidelines  in  MIL  HDBK  217B. 

While  not  used  in  this  study,  columns  for  the  spares  confidence 
level  are  shown.  When  used,  these  data  indicate  the  maximum 
number  of  repair  actions  (and  therefore  spare  elements)  required 
at  a specific  confidence  level  for  specified  number  of  years  and 
duty  cycles. 
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Figur<^  EJ  FMP  Reliability  Data  Lower  Bound 
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Figure  E*!  FMP  Reliability  Data  - Lower  Bound  (continued) 


Figurf-  E.2  FMP  Rf-1  lability  Data  - Probable-  Case- 
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Figure  B.3  FMP  Rf-liabiJity  Bata  - Uppc-r  Bound 
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APPENDIX  F 

SYSTF^i  THROUGHPUT  AND  UTILIZATION  ANALYSIS 


Ftl  SUMMARY 

The  study  oi  the  feasibility  of  the  NASP  would  be  incomplete  if 
only  the  hiah-performance  computational  engine,  called  the  Plow 
Model  Processor  (PMP),  were  considered*  The  facility  must  have 
sufficient  support  equipment  so  that  the  PMP  will  not  be  idle  due 
to  bottlenecks  elsewhere  in  the  facility. 

Chapter  2 of  this  report  introduced  the  expected  operational 
environment*  F:c'»^e  P*1  shows  the  organization  of  the  facility  at 
the  level  considered  in  this  initial  study  of  the  facility* 
Reference  can  made  back  to  Figures  2*1  and  2*2  to  understand 
the  level  of  tnis  model*  In  particular,  analysis  to  this  point 
does  not  include  the  structure  of  the  data  communications,  proces- 
sing an  1 terminals  local  to  the  users.  All  of  those  capabilities 
are  lumped  under  the  term  "Users”* 

Since  dra  t copies  of  the  system-level  operational  scenarios  were 
not  available  until  late  in  the  study,  some  of  the  system-level 
analysis  ovicinaily  planned  has  not  been  completed*  The  analysis, 
desct '.bed  In  more  detail  below,  specifically  considers  the  loading 
of  th3  Flc*.  Mcdel  Processor,  the  File  System  and  the  Support 
Proce.ssor^  The  data  transfer  requirements  between  each  of  these 
major  s^^stem  components  and  to  the  Users  are  also  considered* 

The  analysis  shows  that  the  system  proposed  during  the  Preliminary 
Study  [1,  2]  woviid  be  inadequate  to  support  the  operational  scen- 
arios provided  during  this  feasibility  study.  In  particular,  the 
support  processor  would  have  been  a bottleneck  as  far  as  comp- 
utational capability  is  concerned  and  the  data  transfer  require- 
ments to  and  from  the  support  processor  system  were  underestimated 
originally*  Part  of  the  excess  loading  of  the  support  processor 
system  was  alleviated  near  the  start  of  this  study  when  the  deci- 
sion was  made  to  consider  the  feasibility  of  a system  where  file 
management  was  a function  supported  by  the  file  system  itself 
rather  than  by  the  Support  Processor*  This  analysis  has  shown 
that  support  of  the  major  formatting  requirements  for  both  hard- 
printers  and,  most  especially,  for  Computer-Output  to  Micro- 
film (COM)  should  be  removed  from  the  Support  Processor  to  either 
a peripheral-support  processor  or  perhaps  to  the  FMP  itself. 

F.2  MODEL  AND  ASSUMPTIONS  USED  FOR  ANALYSIS 

Figure  F*1  shows  the  general  model  used  for  this  analysis.  The 
analysis  performed  was  an  operational-type  analysis  based  on  NASF 
operational  scenarios  included  in  the  original  NASP  Utilization 
document  [3]  as  updated  during  subsequent  discussions* 
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The  data  presented  in  the  scenario  was  in  terms  of  job  classifica- 
tions. The  following  cases  were  used  to  represent  the  various 
types  of  use  encountered  during  a “typical”  NASF  day. 

l.A  Method  and  Code  Development  using  scaled  down  problems. 

1. B  Grid  Modification. 

2.  A Larger  code  development  as  well  as  grid  and  result  array 

generation. 

2 . B  Grid  Gener  at  ion . 

3.  Simpler  simulations  on  a large  grid  (such  as  inviscid  flows 

with  boundard  layer  correction). 

4.  Typical  viscous,  steady  flow  simulations  used  for  design, 

resulting  in  a single  solution. 

5.  Viscous,  steady  flow  simulations  requiring  several  solutions, 

such  as  design  optimizations. 

6.  Unsteady  viscous  flow  simulations  for  design  applications. 

7.  Large  fluid  physics  research  simulations. 

These  cases  correspond  to  the  column  headings  in  Table  F.l. 

Regardless  of  case,  each  user  has  a sequence  of  tasks  to  be  per- 
formed in  order  to  complete  his  job.  These  tasks  were  generally 
identified  in  four  major  areas: 

A.  Simulation  Program  Input 

B.  Simulation  Input  Data  Preparation 

C.  Simulation  (execution) 

D*  Output  of  Simulation  Results 

The  actual  detailed  tasks  defined  in  the  utilization  document  [3] 
were: 

(A.  Simulation  Program  Input) 

1.  Source  Module  Generation  is  the  task  of  inputting 
“new”  simulation  source  programs  into  the  system. 

Source  Module  Editing  is  the  task  of  editing  source 
modules  as  required  by  input  or  compile  errors. 

Source  Module  Compilation  is  the  task  of  compiling 
source  modules  of  simulation  programs  into  object 
programs. 

4.  Linking  is  the  task  of  collecting  all  the  object  modules 
which  are  required  for  a simulation  and  cleaning  up 
incomplete  address  binding  prior  to  loading  into  the 
PMP  for  execution. 


(B.  Simulation  Input  Data  Preparation) 

5*  Configuration  Generation  is  the  task  where  surface  coordi- 
nate tables  are  input  and  surface  patches  are  computed# 
The  model  assumed  that,  in  some  cases,  the  FMP  could 

compute  the  surface  patch  coefficients  but  that  the 
processor  local  to  the  graphics  stations  could  also 
perform  this  computation* 

Surface  Grid  Generation  (Not  separately  modelled  but 
cons idered  part  of  Task  7)# 

Field  Grid  Generation  is  the  task  which  computes  the 

coordinates  of  the  grid  to  be  used  during  flow-field 

computations  on  the  FMP*  The  model  assumed  that  this 

task  would  be  executed  on  the  FMP  when  operator  verifi- 
cation of  the  resulting  grid  was  not  necessary*  Other- 
wise, it  would  be  executed  on  the  Support  Processor  in 
order  to  have  prompt  display  of  results  to  the  operator* 

Gathering  is  the  task  which  is  used  to  specify  the 
parameters  of  a particular  simulation  run  and  to  begin 
staging  data  to  the  FMP. 

Execution  is  the  task  which  runs  a job  on  the  FMP. 

10*  Preselected  Data  Display  is  the  task  which  outputs  data 
which  had  been  organized  during  FMP  execution  to  line 

printers,  to  graphics  terminals  and  to  microfilm 
printers. 

11*  Interactive  Post-Execute  Display  is  the  task  which 
supports  the  selective  extraction  of  data  from  the  FMP 
output  files.  The  data  extracted  would  be  requested  by 
and  displayed  to  the  user  at  a graphics  console. 

Debugging  Display  is  the  task  of  formatting  and  display- 
ing (in  some  appropriate  manner)  that  information  saved 
by  the  FMP  when  a run  aborts. 

13.  Restart  Dumps  is  a subtask  of  Task  9 and  involves  taking 
a snapshot  of  the  status  of  a simulation  run  to  be  used 
as  a restart  or  initialization  point  on  a later  run. 

Table  F.l  summarizes  the  important  data  for  evaluated  the  model 
studied. 
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Table  F^l  NASF  Operational  Scenerio  Data 
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Table  F.l  NASF  Operational  Scenerio  Data  (continued) 


FILE 


Table  F.l  NASP  Operational  Scenerio  Data  (continued) 


Table  F*1  NASF  Operational  Scenario  Data  (continued) 


Table  F*1  NASF  Operational  Scenerio  Data  (continued) 
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Table  F.l  NASF  Operational  Scenario  Data  (continued) 


In  addition  to  the  information  provided  by  NASA,  it  was  necessary 
to  make  some  assumptions  during  the  course  of  the  analysis.  These  ^ 

assumptions  were  based  in  part  on  experience  and  in  part  on  judge- 
ment, Table  P,2  summarizes  these  assumptions. 


TABLE  P,2 

Significant  Assumptions 


0,2  Fraction  of  Users  who  use  the  Support  Processor  to  do  data 
entry  & editing 

40  Average  length  (chars)  of  source  statements 

50  Average  length  (chars)  of  control  messages 

0.2  Fraction  of  a module  fixed  or  modified  on  each  edit 

4000  Average  additional  compiler  output  (characters)  over 
and  above  the  source  statements  per  module 

8 Number  of  words  of  object  program  per  line  of  source 

0,25  Fraction  of  modules  with  bugs  which  are  waiting  change 
and  which  will  be  batched  as  far  as  BINDING  (linking) 

0.1  Fraction  of  edited  program  codes  which  must  be  completely 
bound  or  linked  (others  will  be  replacement  bound) 

0,5  Fraction  of  solution  parameters  (of  an  earlier  run)  mod- 
ified to  setup  the  next  run 

8 Number  of  characters  out  of  FORMATTER  for  each  word  in 

(used  to  format  printouts  & COM) 

6 Number  of  8 bit  characters  per  word  - (6x8  = 48  bits) 


1,25  X lO**  Max  size  of  archive  (characters) 

,20  Fraction  of  file  access  from  active  data 

,70  Fraction  of  file  access  from  long-term  data 

,10  Fraction  of  file  access  from  archive  data 
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ANALYSIS 


F.3 


The  analysis  first  defined  the  sequence  of  events  required  to 
implement  each  task.  The  relationship  of  system  resources  during 
each  task  was  then  charted  based  on  an  approach  suggested  by  Prof. 
Anatol  Holt  (Soston  University).  These  charts  or  diagrams  are 
intended  to  separate  spatial  relationships  and  temporal  relation- 
ships. Figure  F.2  shows  the  resource-relationship  charts  for 
Tasks  1 through  5 and  7 respectively.  The  interpretation  of  these 
charts  is  straight-forward.  The  various  NASF  resources,  sometimes 
including  equipment  local  to  the  users,  are  shown  left  to  right 
across  the  chart.  The  sequence  of  events  required  to  complete  a 
task  is  represented  from  the  top  to  the  bottom  of  the  chart.  For 
example,  the  first  chart  of  Figure  P.2  is  for  Task  1 - Source 
Module  Generation.  The  sequence  of  events  shown  is: 


Create  File 
Enter  Records 
Save  File 
Create  File 
Save  Working  File 


The  first  Create  File  event  involves  the  user,  his  terminal,  data 
comm  controls,  the  operating  system  on  the  support  processor  and 
working  files  (in  the  support  processor).  Thus,  the  system 
resources  which  interact  to  implement  each  event  of  the  task  are 
delineated.  The  second  event,  Enter  Records,  involves  a bulk  move 
of  data.  This  type  of  interaction  is  shown  with  the  curved 
corners.  This  notation  allows  the  natural  flow  of  data,  here  from 
the  user  to  the  working  file,  to  be  shown  clearly.  Each  resource 
involved  in  an  event  commits  something  of  that  resource  to  the 
event,  whether  it  be  space  (storage)  or  time.  These  committments 
were  the  point  of  the  analysis  performed.  Note  that  the  charts 
shown  contain  more  information  than  was  utilized  in  the  analysis 
to  date.  In  particular,  the  processing  capabilities  and  communi- 
cations local  to  a user  were  not  studied  yet. 


After  the  resource-relationship  charts  were  prepared,  they  were 
used  as  templates  to  prepare  a straight-forward  program  which 
would  collect  the  operational  scenario  data  in  the  manner  de- 
scribe?d  by  the  charts  in  order  to  generate  the  results.  Thus  the 
charts  could  be  used  to  identify  how  many  control  messages  moved 
and  what  data  transferred  between  elements  of  the  model.  The  NASA 
provided  data  was  used  to  identify  the  average  frequency  of  a 
task, the  amount  of  data  involved,  and  the  processing  time 
required. 


Edit  of  source  program 
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TASK  3 - SOURCE  MODULE  COMPILATION 


Figure  F.2  NASF  Resource  Relationships  (continued) 
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Figure  F.2  NASF  Resource  Relationships  (continued) 


TASK  7-  FLOW  FIELD  GRID  GENERATION  A-Baseline  -FMP 


Figure  F.2  NASF  Resource  Relationships  (continued) 


TASK  9 - FMP  EXECUTION 


TASK  10  - PRESELECTED  DATA  DISPLAY  A - Print 


Figure  P.2  NASF  Resource  Relationships  (continued) 
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Figure  F*2  NASF  Resource  Relationships  (continued) 


The  program  developed  to  support  the  analysis  allowed  spec- 
ification of  the  parameters  of  the  support  processor.  Since  the 
number  of  CPU’s  needed  as  part  of  the  support  processor  system  was 
of  interest,  the  capabilities  of  a particular  single  processor 
were^  defined.  The  analysis  then  showed  the  support  processor 
loading  in  terms  of  that  one  processor.  Prom  that  analysis,  the 
number  of  processors  required  could  be  determined.  For  example, 
if  the  average  load  was  determined  to  be  1.6  processor-hours  for 
each  hour,  then  at  least  two  processors  are  required. 

Table  F.3  summarizes  the  characteristics  of  three  support  proces- 
sors considered  during  the  analysis.  The  processors  are  identi- 
fied ”A“,  and  "C".  This  form  of  identification  was  chosen 

since  none  of  the  data  has  been  completely  verified  on  any  proces- 
sor. In  general,  the  processors  are  characterized  as 

A B7700 
B B7800 

C Future  Processor 

The  data  concerning  editing,  compiling,  and  formatting  on  proces- 
sor A is  based  on  benchmarks  run  on  a mix  of  FORTRAN  programs. 
The  other  values  are  estimates  based  on  best  knowledge  and  judg- 
ment. 

In  addition,  where  it  seemed  appropriate  during  the  evaluation, 
some  modifications  were  made  to  the  NASA  supplied  operational 
scenarios.  In  particular,  the  anlaysis  was  performed  for  some 
cases  without  the  COM  output  (Task  IOC)  in  an  attempt  to  identify 
the  impact  of  that  large  amount  of  output. 

The  analyzer  currently  generates  results  in  terms  of  daily  average 
loads.  Hand  reduction  of  the  results  from  the  analyzer  is  used  to 
generate  hourly  average  loads. 

F.4  RESULTS 

Before  considering  the  results,  a WARNING  must  be  stated.  The 
analysis  to  date  only  considers  average  data  rate  and  processor 
time  requirements.  This  would  only  be  true  under  conditions  of 
optimum  system  balance  and  concurrency.  Factors  which  are  needed 
to  predict  the  peak  rates  which  an  eventual  design  must  consider 
have  not  been  included. 

The  analysis  considered  three  major  factors  of  the  NASP  system 
model.  First,  the  data  transfer  requirements  for  transfers 
between  each  of  the  components  of  the  model  in  Figure  F-1  were 
determined.  Then  the  amount  of  file  level  activity  was  deter- 
mined in  order  to  estimate  the  processor  required  to  support  the 
file  system.  Finally,  an  amount  of  Support  Processor  and  PMP 
processing  time  was  determined  using  the  assumptions  previously 
stated. 
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TABLE  F.3 


SUPPORT  PROCESSOR  CHARACTERIZATION 


A 

B 

C 



Edit  Time  (Sec/Strat) 

• 01 

.0067 

.001 

Compile  Time  (Sec/Stmt) 

.007 

.0047 

.001 

Compile  Overhead  (Sec/Module) 

0.1 

.067 

.013 

Linking  Time  (Sec/Object  Word) 

.0007 

.00047 

.0001 

Linking  Overhead  (Sec/ Code) 

.01 

.0067 

.001 

Grid  Generation  Rate  (Sec/Grid  Element) 

.001  ~ 

.00067 

.0001 

Operating  System  overhead  (Sec/Task) 

0.0 

.067 

.0133 

Formatting  Rate  (Sec/Word  Formatted) 

0.0006 

.0004 

.0001 

Output  Selection  Rate  (Sec/Point 

Selected) 

0.01 

.0067 

.0013 

P«4#l  Processor  Loading 

Table  F*3  summarized  the  basic  characteristics  of  the  processors 
considered  in  this  analysis*  The  data  in  this  table  concerning 
output  formatting  is  based  on  the  normal  interpretive  execution  of 
a formatter  package  which  is  driven  by  the  FORMAT  statements*  One 
of  the  major  reasons  such  a system  is  interpretive  is  the  possibil- 
ity of  variable  format  statements*  However^  most  of  the  applica- 
tions observed  (both  the  aero  and  weather  codes  among  others)  have 
rather  straight-forward,  fixed  formatting*  The  improvement  in 
formatting  time  that  could  occur  if  the  formatter  was  compiled  and 
executed  per  the  statements  given  rather  than  interpreted  was 
assumed  to  be  a factor  of  5*  Each  of  the  processor  of  Table  F.3 
were  considered  under  both  the  standard  scenario  and  under  a 
scenario  with  no  COM  activity  (no  Task  IOC).  In  addition,  all 
cases  were  studied  with  both  the  existing  means  of  formating  and 
with  the  hypothesized  improvements*  Table  F*4  summarizes  the 
results  in  terms  of  support  processor  loading* 

Note  that  processor  "B”  seems  to  be  committed  to  9*5  CPU  hours  per 
hour  when  the  standard  scenario  (interpretive  formatting  and  COM 
output)  is  considered*  A 10  processor  system  would  satisfy  such  a 
requirement  assuming  a better  than  optimum  multiprocessing  system* 
If  the  COM  formatting  task  is  off-loaded,  then  only  *88  CPU  hour/ 
hour  are  committed* 

Now  consider  how  this  load  is  distributed  through  the  operational 
day*  Figure  F*3  shows  a distribution  of  jobs  over  the  day.  This 
distribution  is  slightly  simplified  from  the  scenario  given*  In 
this  case,  22  hours  of  operations  were  assumed*  The  loading  shown 
is  such  that  no  workload  case  which  represents  long  job  overlaps 
with  a work-load  case  which  represents  short  jobs*  When  this 
distribution  of  jobs  is  assumed,  the  average  SPS  processor  loading 
per  shift  can  be  determined.  Figure  P*4  shows  this  evaluation  for 
processor  B with  no  COM  formatting* 

Figure  F.4  shows  the  same  schedule  of  job  execution  as  Figure  F.3* 
The  columns  on  the  right  side  of  the  figure  show  the  total  CPU 
load  determined  for  each  case  (output  of  the  analyzer).  The  load 
for  each  case  was  then  averaged  over  the  shift  in  which  it  is 
scheduled  to  determine  the  average  CPU  loading  over  the  shift. 
The  loading  is  shown  a CPU-hours  per  hour*  Note  that  in  Table 
P*4,  *88  CPU-hours  per  hour  is  the  average  load  over  a day*  How- 
ever, Figure  F*4  shows  that  when  the  load  is  distributed  by  shift, 
the  peak  load  is  1.64  CPU-hours  per  hour. 

Even  this  rate  is  optimistic  since  the  actual  loading  of  an  inter- 
active system  is  not  uniformly  distributed  across  the  day.  Load- 
ing will  tend  to  have  peaks  and  valleys*  If  there  is  a vari- 
ation of  30%  from  the  average,  then  the  peak  rate  would  be  2*13 
processor  - hours/hour*  A trade-off  between  response '-time  and 
system  complexity  and  cost  must  now  be  considered.  By  limiting 
the  system  to  two  processors  of  this  sort,  these  processors  would 
be  busy  most  of  the  Lime,  depending  on  load  peaks  which  are  not 
under  the  control  of  the  operations  staff* 


Table  P.4 


Daily  Average:  Support  Processing  CPU  Hours/Hour 


Scenario  [ 

PROCESSOR 

FORMATTING 

With  COM 

Without  COM  I 

Interpretive 

14*2 

1.31 

A 

Direct  Execute 

■ 

1*12 

Interpretive 

9*5 

.88 

B 

Direct  Execute 

i 3*3 

1 

*76 

1 

1 

Interpretive 

1 2.8 

f I 

•19  i 

C 

Direct  Execute 

! .7 

i — _ 

1 .15  j 

! --4 
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The  analysis  described  above  is  shown  in  Figure  P,5  for  processor 
(a  future  processor).  The  case  shown  includes  the  COM  output 
load  and  assumes  the  improved  formatting  rate.  The  analysis  of 
loading  due  to  COM  formatting  is  an  approximation  unfortunately. 
The  actual  use  planned  is  to  produce  graphics  images  on  the  COM 
device.  A sequence  of  many  frames  would  become  a movie  showing 
the  dynamics  of  a model.  Since  no  information  concerning  graphics 
formatting  was  available  at  the  time,  the  approximation  was  made 
that  COM  formatting  would  be  the  same  magnitude  of  task  as  format- 
ting printed  alphanumeric  printout.  This  approximation  may  be 
somewhat  optimisic. 

Figure  F.5  shows  that  peak  CPU  loads  on  the  support  processor 
occur  during  second  and  third  shift.  Note  that  the  case  shown  in 
Figure  F.4  has  the  amin  load  during  prime-shift.  The  difference 
is  the  COM  formatting  which  peaks  in  Cases  6 and  7 (see  Task  IOC 
in  Table  F.l).  The  loading  shown  in  Figure  F.5  indicates  addit- 
ional processor  time  available  during  the  prime  shift  (8  am  EST  to 
5 pro  PST) . 

One  variation  of  the  scenario  was  tested  to  see  the  impact  on 
support  processor  loading.  The  variation  was  to  change  the  frac- 
tion of  editing  done  on  the  support  processor  from  0.2  to  0.8. 
The  increase  in  editing  load  brings  the  CPU  loading  from  .315  to 
.317  CPU  hours  per  hour  (average  over  the  shift), 

F.4. 2 File  System  Activity 

In  order  to  begin  to  evaluate  the  file  system  in  detail,  the  major 
functional  demands  were  determined,  based  upon  the  previously 
described  scenarios.  Two  types  of  demands  were  considered;  data 
transfer  and  control. 

Data  transfer  demands  were  considered  based  on  each  of  the  inter- 
connection paths  in  Figure  F.l.  The  data  transfer  rates,  averaged 
by  day  and  by  shift  (as  defined  by  Figure  F.3)  are  shown  in  Table 
F.5.  Here  again,  the  rates  are  averaged  either  over  the  day  or  a 
shift  and  do  not  consider  peak  loading.  It  is  interesting  to  note 
that  if  the  Support  Processor  is  relieved  of  the  COM  formatting 
task,  the  rates  for  Support  Processor  — File  System  (correspond- 
ing to  the  first  line  of  Table  F.5)  become  7.711K  char /sec  average 
over  the  day  and  the  three  shift  rates  become  56.90,  15,38K,  and 
44.5  char/sec  respectively  averaged  uniformly  over  the  shift. 
Again,  the  major  reduction  is  during  non-prime  time. 

Control  functions  were  considered  with  respect  to  file  activity. 
Based  on  the  NASA-supplied  scenarios,  the  number  of  file 

creations,  file  deletions,  and  file  accesses  per  day  were 

determined  for  the  active  high-speed  access  files,  for  the  long- 
term, somewhat  slower  access  files  and  for  the  slow  access  archive 
files.  In  addition,  the  number  of  times  that  an  active  file’s 
contents  were  replaced  by  new  contents  was  determined.  These 
results  are  shown  in  Table  F.6. 
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TABLE  P.5 


NASF  DATA  TRANSFER  REQUIREMENTS 
(with  COM) 


RATE  (Char/ Sec) 

Daily 

Hourly  Averac 

je 

5pm- 12M 

Average 

12M-3am 

Sam- 5pm 

Support  Processor  - File  System 

29,240 

83.388K 

16.678K 

35.937K 

Support  Processor  - FMP 

.050 

.02 

.08 

.02 

Support  Processor  - Users 

4,453 

.228K 

8.125K 

.187K 

Pile  System  - Users 

24,260 

3.002K 

45. 9K 

1.554K 

File  System  - FMP 

V 

163,400 

294.770K 

210.032K 

73.770K 

TABLE  F.6 

NASF  FILE  SYSTEM  CONTROL  ACTIVITY  PER  DAY 


FILE  ACTIVITY 

PILE  TYPE 

ACTIVE 

LONGTERM 

ARCHIVE 

Piles  Created 

2483 

1127 

627.3 

Files  Deleted 

2483 

1127 

627.3 

Piles  Accessed 

19810 

827.7 

118.3 

Piles  Replaced 



1302 

— 

— 

F*5  FUTURE  WORK 


In  order  to  make  the  system  analysis  more  accurate,  benchmarks 
must  be  developed  which  represent  the  work  to  be  supported  by  the 
Support  Processor*  The  magnitude  of  editing,  compiling,  and 

linking  can  be  determined  given  the  existing  codes  and  given  the 
assumption  that  the  compilers  and  linkers  developed  to  support  the 
FMP  would  be  of  the  same  complexity  as  that  of  the  existing  codes 
on  existing  machines.  Benchmarks  can  be  developed  to  study  the 
actual  formatting  rate  and  the  SPS  committment  to  task  management 
and  I/O.  More  accurate  estimates  of  the  grid  generation  task  and 
the  interactive  graphics  support  tasks  need  to  made. 

All  system-level  modelling  must  be  operationally  based.  That  is, 
the  results  of  any  system-level  modelling  should  be  easily 
verified  by  direct  observation  of  an  actual  system.  If  this 
guideline  is  followed,  verification  of  the  models  will  become 
straight-forward. 
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APPENDIX  G 


PMP  FORTRAN  EXAMPLES  with  ORIGINAL  FORTRAN  SOURCE 


The  listings  which  follow  in  this  appendix  are  examples  of  FMP 
FORTRAN  codes  with  their  original  FORTRAN  source  code*  The  FMP 

FORTRAN  versions  are  all  mentioned  in  Appendix  A with  regard  to 
the  analysis  and  simulation  activities  of  the  study*  In  some 
cases  (e*g.  LINKHO,  C0MP2  of  the  GISS  weather  codes,  and  the  BTRI 
of  the  Implicit  Aero  Code),  the  resulting  FMP  FORTRAN  is  incom- 
plete* These  cases  were  taken  only  far  enough  to  be  able  to 
generate  an  accurate  timing  since  a functional  simulation  was  not 
required  at  this  time. 

The  listings  provided  are  identified  in  the  table  below: 


Figure 

Application 

Identification 

G*  1 

Implicit  Aero 

Smooth 

G.2 

Implicit  Aero 

BTRI 

G.3 

Explicit  Aero 

OUTER 

G.4 

Explicit  Aero 

TURBDA 

G*5 

Explicit  Aero 

LX 

G.6 

GISS  V^eather 

COMP2  Section 

G*7 

GISS  VJeather 

LINKHO  (part  of  COMPS) 

G-1 


Original  FORTRAN 


x7^^00 

SUBROUTINE  SMOOTH 

SMOOTH  2 

172500 

COMMON/BftSE/NMRX, JMfiX , KMRX , UMAX , UM , KM , LM , D7 , a AM MR, GAM  I , SMU , F SM ACH 

BASE  2 

172600 

1 ,0X1,DY1,D21,ND, ND2  j F V k 5 > , FD ( 5 ) , HO , ALP , GO , OMEGA , HDX, HD V, HO 2 

BASE  3 

172700 

2,RM,CNBR,PI,ITR,IHyiSC,LAMIN,NP,INTl,INT2,INT3 

BASE  4 

172800 

CaMMaN/GEa/NBlyNB29RFRONT|RMAX,XR,XMAX,ORAD,DXC 

BASE  5 

172900 

COMMUN/READ/IREAD, IWRIT, NQRI 

BASE  6 

173000 

COMMON/UIS/RE, PR, RMUE, RK 

BASE  7 

173100 

CaMMaN/MARS/6K720 , 6, 30  > 

VARS  2 

173200 

COMMOH/UARU/S V 720 , 5,  30  > 

VARS  3 

173300 

CPMMON/uari/xC720,30>,  Y^720,30  >,2<720,30:j 

VARS  4 

173400 

COMMON  /UAR3/P(120,30 ) , XX < 60 , 4 > , V Y C 60 , 4 ^ , 2Z < 60 , 4 ; 

VARS  5 

173500 

LEVEL  2,G,S,X,Y,2 

VARS  6 

173600 

COMMON  /FLSH/DX2, DY2, 022 

FLSH  2 

173700 

COMMON/IOX/LMM, KMM , JMM 

I DX  2 

173800 

C 

SMOOTH  7 

173900 

c 

FOURTH-ORDER  SMOOTHING 

SMOOTH  8 

174000 

c 

SECOMO  ORDER  AT  THE  BOUNDARIES 

SMOOTH  9 

174100 

c 

SMODTHIO 

174200 

SMI  = SMU«,5 

SMDOTHli 

174300 

DO  lO  L ^,L\^ 

SMOOTHIE 

174400 

DO  10  K s 2, KM 

SMDDTH13 

174500 

KL  « \L***;XND-fK 

SMOOTH14 

174600 

DO  12  U a 3, VMM 

SMOOTHlS 

174700 

DO  12  N s 1,5 

SM00TH16 

174800 

12  S<KL,M,J^  a S C KL , N , J ; -SMUR ( S f KL , N , J-2 > AG < KL , 6 , J-2 > < KL  , N , Jt2 ) a 

SMODTH17 

174900 

1 G<KL,6,OT2>Y6,«GiKL,N,J^«GvkL,6,J;-4,xevKL,N,vlTi;AG(KL,6,Ji'l; 

SHODTHia 

175000 

2 -4.KG(KL,N,J-x;xGvKL,6,j-j.O/GfKL,0,j; 

SMaOTHi9 

175100 

DO  10  N a j.,5 

SMODTH20 

175200 

SCKL,  N,  2>  a S(  KL,  N?  2 ^-rSMlX^GvKL,  N,  AGC  KL,  6, 3>-2,  XQ<KL,  N,  2>A 

SMDDTH21 

175300 

1 S(KL,6,2>  + G<KL,M,x;AG(KL,6,ju^;/G<KL,6,2> 

SMDOTH22 

175400 

SCKL,  H,  JM>  a S < KL  , N , JM  J tSMIA  < G ( KL  , N , JM  AX  J AG  KL  , 6 , JM  AX  J -2  . «G  KL  , N , 

SMOOTHES 

175500 

1 JM  JXGi^KL,  6,  UH^tG^KL,  H,  JMM^AG<KL,  6,  JMM^>/Gi,KL,6,JM^ 

SMDQTH24 

175600 

iO  CONTINUE 

SMOOTHES 

175700 

c 

SMQOTH26 

17580  0 

c 

SMOOTH27 

175900 

DO  20  J a 2,JM 

SMOOTHES 

1760  0 0 

DO  20  L a £,lM 

SMODTH29 

176100 

DO  u2  K a 3, KMM 

SMOOTH30 

176200 

KL  a vL-1jaHOtK 

SMODTHBI 

176300 

DO  22  N a 

SMDOTH32 

176400 

22  SvKU,N,U^  a S(  KL  , M,  J ^ -SMUai,  G C KL  + 2 , M , J y xG  f KL*r2 , 6 , J ; ^ KL-2  , N , J ^ XSMODTH35 

Figure  G.l  Implicit  Aero  - SMOOTH 
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100 

EOO 

300 

HOO 

mo 

420 
430 
440 
700  C 
1000 
120  0 
1300 
1400 
1500  1 


REPRODUCIBILITY  OP 
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SUBROUTINE  SMOOTH 

COMMDN/BftSE/NMftX, JMRX, KMflX, UMfiX, DT,eflMMft, a«MI , FSMRCH, 

1 OXi,DYi,DZl,rM<5),FD<5>,HO,BLP,SO,OMEa«,HOX,HOY,H02,RM, 

2 CMBR, PI , ITR, NP, INTI, INT2, INT3 
ODMftIN  /MDOEL/<  J»l,l005  Ksl,50;  L*=i,200 

RE6I0N  /THREEO<  < J:=2,  JMfiX-1  ) , ( K«2  , KMflX-1  ) , < L«2  , LMfiX-*i  ) ) / 

X t:  /MDOEL<  J,  K,  L>/ 

INflLU  /MODEL/  iSC  6 ) , S < 5 ) , SS  , CT  C 5 ) , T1 , T2  , T3  , T4 
4th  order  SMOOTHINS,  2d  order  RT  the  BOUNDRRIES 
DORLL  /THREED<U, K, L^/  } USING  Q,  S,  SHU 
TEMP  = 1, /Q<U, K, L, &) 

DO  i Wsl,5 

CT<N>=  J, K, L, N^XS< J, K, L, 

CONTINUE 


1600 

IF  <J,E©.2  .DR,  J.E©.JMftX“l> 

THEN 

1700 

Ti  s Q<Jrl,K,L,6> 

iSOO 

T2  “ a<U-i,K,L,6> 

1900 

DD  2 Nsi,5 

2000 

SS  s S(J,K,L,H.i  i*  0 , 5xSMUX<G<  JtI 

2100 

A , K,  L,  W>AT2)3I(TEMP 

2200  2 

CONTINUE 

2300 

ELSE 

240  0 

DO  3 Nsl,5 

2500 

T1»Q( J+2, K, L, 6> 

2600 

T2sG< J-2, K, L, 6> 

2700 

T3set^  JtI,  K,  L,  6> 

2300 

T4sQ( J-x, K, L, 6> 

2900 

SS  n S<J,K,L,N>  t SMUx(a<Uir 

2,  K,  L 

3000 

1 4.x<®<Jfl,K,L,N;xT3  + 

J-i 

3100  3 CONTINUE 

3200  ENDIF 

3300  NEXTDO 

3400  IF  (K.EQ.2  ,DR,  K.EG.KMRX-1)  THEN 

3500  T1SQ< J, KtI, L, 

3600  T2sD< J, K-i, L, 6> 

3700  DO  4 Nsl,5 

3300  SS  a SS  + 0 , 5xSMUX<G<U, K+X, L, N>XTi  + S ( J , K- A , U , H ^ XT2  - 

3900  i 2,xCT(N) )xtEMP 

4000  4 CONTINUE 


Figure  G*1  Implicit  Aero  - SMOOTH  (Cont*d) 
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170500 

176600 

176700 

176600 

176900 

177000 

177100 

177200 

177300 

177400 

177500  C 

177600  C 

177700 

177300 

177900 

173000 

173100 

173200 

173300 

173400 

-.73500 

178600 

173700 

173300 

173900 

179000 

179100 

179200 

179300 

17940m 

179500 

179600 

179700 

179300 

179900 

130000 

160100 

130200 


30 


Q<KI--2, 6,J;r6*«Cf(KL,N, 
-4.  N,  J / A<S<KL-i 

Da  20  H » 1,5 
KL  « <U-i;ANOT2 
S<KU,W,J^  u 

x«<KU,b,J^ftsSkKl.-i,N,J> 
KU  « (L-x>J«cNP-rKM 
:;CKL,N,J;  « :5CKU,N,J^r 
xG<KLj,6,Jy-r6i^KL-x,N,J/ 
CDWTINUE 


DO  30  J s 2,Jtt 

DD  30  K s 2,KH 

DO  32  L ss  3,LMM 

KL  K (L-x/AHD+K 

11  s KUtNO 

12  S2  KLt2xND 

13  s KL-ND 

14  s KL-2XND 

OD  32  N a 1,5 
S<KL,H,J;  a S(KL,N, 
0<I4,6,  J)-r6,  xa(KL,N 
G<  I3,N,  I3, 6,  J j 

DO  30  N s 1,5 

KL  a HO+K 

11  a KLfND 

12  a KL-MD 

S<KL,M,J>  a S(KL,N, 
Q<KL,6,  l2,  N,  J; 

KL  a LHMXNO+K 

11  a KLtNO 

12  a KL-NO 

S(KL,N,J;  a JJ(KL,N, 
Ct<KL,6,  12,  W,  J > 

CONTINUE 


RETURN 

END 


JMQ<KL,b,J>-4.x»S(KL«M,N,J>xaCKL^l,6,J^  SHDDTH34 
, 6,  J;  ;/<AkKL,  6,  SMODTH35 

SH00TH36 

SMDOTH37 

SMIXC&CKLtx,  W,  J>X(a(KLTi,  6,  j;-2*  KL,  N,  j;  SMOOTHES 
xCiCKL-x,  6,  J / >/Ci(KL,  6,  J>  SMDDTH39 

SMODTH40 

SM1X\  6!<KLy1,  N,  J;»X(S<KL  + x,  6,  v);~2.  KL,  rj,  vI^:>MDDTH4A 

X«(KL-1, 6,  J > >/tJ\KL,  6,  vl SMOOTH 42 

SMDDTH43 

SM00TH44 

SMDOTH45 

SMOOTH46 

SMODTH47 

SMOOTH48 

SMOOTH49 

SMODTH50 

SMODTH51 

SMOOTH52 

SMQQTH53 

SM0DTH54 

J>~SMUxkG<l2,N,vi;xG<l2,6,J;T6»^l4,H,J;t«  SMOOTHSS 
,J;x<a<KL,b,U^-4,x6i<Il,N,Jyx«<Il,6,U;-4,x  SMOOTHS  6 
>/0CKL,6,J;  SMOOTHS? 

SMOOTHSS 

SMDDTH59 

SMODTHbO 

SMOOTH61 

J>i*SMxX<^a<  li,  N,  J/x«<  II,  6,  J ;-2,  x«<KL,  N,  J/XSMDDTH62 
xac.l2, 6,  p/ftCKL,  6,  j;  SM00TH63 

SMDOTH64 

SMODTH65 

SMDOTH66 

J/-rSMlA(©(  II,  N,  J;xQ(  II,  6,  J^-C.  x0t<KL,  N,  J;xSMDOTHU7 
xft(l2,6,  J.i>/Q<KL,6,  J;  SMDDTHbe 

SMaOTHb9 

SMOOTH70 

SMOOTH?! 


Figure  G.l  Implicit  Aero  - SMOOTH  (Cont'd) 
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moo 

4200 
4300 
440  0 
4500 
4600 
4700 
4800 
4900  5 
5000 
5100 
5200 
5300 
5400 
5500 
5600 
5700 
5800  6 
5900 
6000 
6100 
6200 
6300 
6400 
6500 
6600 
6700 
6800  7 
6900 
7000 
7200 
7300 


cLse 

Kt2,  L,  6> 

T2»6i<  K-2,  L,  6> 

73»«( J, Kfl, L, 6> 

V4c«<  J, K-X, L, 6> 

DD  5 N»l,5 

:>S«SS  + SMU)K(«<  J,  Kf£,  L,  N>«T1  + 9 t J , K-2 , L , H ;kT2  + 

1 4,  A<(S(U,  K + 15U,  N^kT3  + Ci  ( J,  K-x,  L , H > WT4  ) - 6 , McCT  < N ) ) jicTEMP 

CONTINUE 
ENOir 
NEXTDO 

IF  (L,E®,2  ,DR,  L.EQ.UMfiX-l)  THEN 
Tl  = CK  J, K, U + 1, 6> 

T2=Q( J, K, L-1, 6> 

DO  6 N»1,5 

S<J,K,L,N^  s <>S  + 0 , 5)«cSMU:*((€i<  J,  K,  Lrl,  N)5«CT1  + 

J,  K,  L-X,  N>McT2  - 2,  «CT(N>  ) ATEMP 

CONTINUE 

ELSE 

Tl  = <a<U,K,Lf£,6) 
t2  = q(j,k,u-2,6) 

T3  S CK  J,  K,  L*rX,  6> 

T4  t:  «K  J,  K,  L-X,  6) 

DO  7 Hss1,5 

S<J,K,L,N>  = SS  + SMUX<Q<U, K, Lt£, N^«T1  + Q < J , K , L-2 , N > AT2+ 

X 4,y£6KJ,K,Li‘l,N^ycT3  + 4.  k«<  K,  U-X,  H > AT4 

E - 6,  j«cCT<N)  >3«tTEMP 

CONTINUE 
ENDIF 

ENDDO  7THREED/  J QIUINQ  S 
RETURN 
END 
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^FILE 

H2200 

42301) 

42400 

42500 

42600 

42700 

42600 

42900 

43000 

43J.00 

43200 

43300 

43400 

43500 

43600 

43700 

43800 

43900 

44000 

44i00 

44200 

44300 

44400 

44500 

44600 

44700 

44800 

44900 

45000 

45100 

45200 

45300 

45400 

45500 

45600 

45700 

45800 

45900 

46000 

46100 

46200 

46300 

46400 

46500 

46600 

46700 


<Fi000015450SflM; IMPUICXT/B7RI  ON  NSS 
SUBROUTINE  B TR I < I Lft , I UR ; 

COMMON/BTRID/R<60 ,5,5>,B<60,5,5),C(60,5,5>,D<60,5,5),F<60,5) 
DIMENSION  H\,5,5> 

RERL  tli,L2i,L22,u3i,L32,L33,L4i,L42,U43,L44,L51,L52,L53,L54,L55 

ILsIUfl 

XU«IUR 

.:s»iu+i 

ieciu-:l 

C INSERT  UUOEC 

Liisl,/£( IL, i, 1; 

L21=B(IL,2,x> 

Ux2s:B(IL,i,2>«l-ll 

L22stl,/<e<IL,2,£>-L2x«Ul2) 

Ui3s8(  IL,  i, 

Ux4  s B( IL, i, 4>KLii 

UiSsBClU, i,5>ALil 

L31-B( lU, 3, l; 

l32=B*  IL,3,2>-u31aua2 

u23=<B< lU, 2,3>-L2iAUi3>AL22 

l33s1, /<E( IL, 3, 3>-Ui3ALii-u23AL32> 

U24«<B<IU,£,4>-t-2XRUl4)RL2£ 

u25«<8<:L,2,5>-L2xAUi5)5icL22 

L4isB<IL,4,i> 

L42sB<  IU,4,2>**L4x)^Ul2 

l43sB(IL,  4, 3>-i.4lAUl3-L42)«cU23 

u34:2<B<  IL,  S,4  >-l3xaU14-L3£AU£4)  AL33 

l443X./<B<  IL,  4, 4>-Ul4KU4x-U24AL42~U343«tL43) 

u35s(B(  IL,3,5>-L3iRUx5~L32:KU25>AL33 

l5Xs8( lU, 5, i> 

l52sB(IL, 5,2>-l5X«UX2 

l53sB(IL,5,3>-l5Xaux3-u52kU23 

l54sB(IL,  5, 4>-l5X)«cUX4-l5Eau24-l53aU34 

u45s<B< IL,4,5>-L4XAUX5-U42au25-u43iKU35)AL44 

l55sX, /<B< IL, 5, 5>-l51AUX5-l52au25-l53Au35-u54aU45) 

C COMPUTE  LITTLE  R S 

Dl=LilAF(IL,X; 
d2-L22a(F( IL, 2^“L2X«D1) 
d3=l33a<F<IL,3;-L3X>cOX-l32k02> 
d4su44a<F< IL, 4;-l4XADX~L42a02-L43xD3) 

05=l55a<F(  IL,  S';-l5XA0X“L52adE-l55ad3**l54A04) 

C COMPUTE  Bie  R S 

F< IL, 5>=d5 
F<  IL)  4 ;isD4-U45KD5 
F<IL,3>a03-U34wF( IL,4>-U35ad5 
F<IL,2>~D2-u23aF< IL,3>-U24nF<IL,4^-U25A05 
F(  IL,  i>«DX-'UX2AF(IL,2)-UX3xFaL,3>-Ui4AF<  IL,4;-U15AD5 


Figure  G.2 
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SUBROUTINE  BTRKIUR; 


ftSSUME  STARTING  INDEX 


COHMDN  /BTRIO/  ft<IUR,5,5),  B<IUfl,5,5),  C<IUft,5,5), 
i D<IUA,5,5>,  r(IUft,i5> 

DIMENSION  HkS,5'j 


i70 

I MRU* 

CIT  RERLCLj 

i80 

C 

190 

C 

INSERT  LUDEC  <SIMI 

c 

£10 

Lil  - 

i,/Ba,i,i> 

££0 

L££  := 

i./ea,£,£; 

£30 

L33  « 

£H0 

L44  = 

i,/B‘(lj4}4) 

£50 

L55  = 

1./B<1,5,5> 

260 

c 

£70 

c 

COMPUTE  LITTLE  R‘: 

£80 

c 

THIS 

PASS,  COHPUTI 

£90 

c 

30  0 

FU, 

5>  y l55 

310 

F<i, 

4)  » L44 

3£0 

FU, 

3>  a L33 

330 

F<i, 

£>  « L££ 

340 

Fa, 

1)  Lil 

Figure  G*2  Implicit  Aero  - BTRI  (Cont'd) 
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46800  C 

46900 

47000 

47100 

47200 

47300 

47400 

47500 

47600 

47700 

47300 

47900  12 

43000 

43100  C 

43200 

43300  14 

43400 

43500  C 

43600 

43700 

43800  11 

43900 

49000  C 

49100 

49200 

49300 

49400 

49500 

49600 

49700 

49300 

49900 

50000 

50100 

50200 

50300 

50400 

50500 

50600 

50700 

50300 

50900 

51000 

5x100 


COMPUTE  C PRIME  FDR  FIRST  ROW 
DO  12  M«X,5 
PlsLllAC<lL,l,M/ 

D2  = L22x<C<  IL,  2,  M>-Luxj^Dl) 

03=l33a(C( il, 3, M>-l31«D1-l32ad2) 

D4ul44a<C(  IL,  4,  M>-L4xA01*-L42aD2-u43xD3> 
05i=L55A<C<IL,5,M;-L5lADl~L52x02-L533«tD3-L54A04) 

B(IL, 5, M>=d5 

IL,  4,  H >=d4*-U45aD5 

e(lU,3,tU  - 03-U34«B< IL,4,M^-u35a05 
B<1L,2,m>  - d2-‘.  23aB(  IL,  3,  M>~u24aB(  IL,  4,  m>-u25ad5 
B(  IL,  X,  M>  K 0X-Ux2aB(  IL,  2,  M/~Ux3aBOL,  3,  M>-U14A8<  lU,  4,  IO-U15a05 
DO  x3  I=IS, IE 
COMPUTE  B PRIMEABI6R 
DO  14  Nt:l,5 

F<  I , N/:=F(  I , I , N,  i)AF(  I-i,  x^-rt<  I , N,  2>«F<  I-l,£>-fl<I  , N,  3>AF<  I-i,  3 

I ,N, 4;xF( I-x, 4>-fl<I,N, 5>AF< I-x, 5> 

COMPUTE  B PRIME 
DO  11  Nt:X,5 
DO  11  M»l,5 

H(N,Mj=BU,W,M;-H<I,N,X;KBCI-l,i,M>-R(I>N,2>AB<I-i,2,M;-RU,N,3>A 
xB( I-X, 3, M;-RO , N, 4 I-x, 4, M>-rt< I , M, 5> aB( I-i, 5, M; 

INSERT  LUDEC  fiSAIN 
Lll  = l , /H  V X, X > 

U2l:=H\;2,  i; 

U12=HC1, 2>AL11 

u22=X,  / kH<2, 2 *j-l2iaux2) 

Ux3=:H<X,  3>  AUXl 

U14=H<1, 4> ALli 

Ux5  = H Cl, 5> AUii 

u31=h<3, x) 

u32shC3,2>-u31aux2 

U23=ChC2,3^-1.21aux3>au22 

U33^1. /CHv3, 3>-ui3al31-u23au32> 

U242<H<2,4>-l21au14>AL2E 

U25=s<hC2,5>-l21au15)al22 

u4x=HC4, 1^ 

l42sh<4,E>-u4iaU12 

L43=H(4, 3>-l4xAUx3-L4£aU23 

U34a(H<3,4;-U3lAUl4-u32AU24>AU33 

L44aX,/CHC4,4^-Ux4AL4i-U24AU42-u34AL43> 

u35»<h<3, 5>-L3lAUi5-L32AU25> al33 

U51::H<  5 , 1 ^ 

L,5E  = HC5,£)-l51au12 


Figure  G.2  Implicit  Aero  - BTRI  (Cont'd) 
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350  C 
360  C 
370  C 
360 
390  C 
400  C 
4i0  C 

420  c 

430 

440 

450 

460 

470 

460  i£ 
490  C 
50  0 C 
5i0  C 
5E0 
530  C 
540  C 
550  C 
560 

570  14 
560 
590 
600  C 
610  C 
6E0  C 
630 
640 

650  11 
660 
670 
660 
690  C 
70  0 c 
710  C 
720 
730 
740 
750 
760 
770 


CPMPUTE  C PRIME  TOR  FIRST  RUW 
OD  i£  M s 1,5 

C HRS  BEEN  ELIMINRTED  RS  R SIMPLE 
RESUBSCRIPTXN6  OF  THE  D ARRRY 


B(1,5,M^  = 
B(i,4,M> 
B<1,3,M>  = 
B ( 1 , £ , M > * = 
B<1,1,M>  t5 
CONTINUE 


L55  ik  CdfSfiij 
L44  X CCI,4,M> 
l33  a 

lEE  X C < I , E , M ) 
Lxl  X C(1,1,M; 


HERE  NOW  STARTS  THE  MAIN  LOOP  OF  BTRI 
DO  13  I - E, lUA 


COMPUTE  6 PRIME  ft  BI6R 


OD  14  N»i,5 

rdfUj  = ra,H^  - 
i r<i~i,c> 
E FO-1,4^ 


A < I , H , 1 > X F C I - 1 , 1 ) **  A < I , H » 

- A(I,N,3>  X F<I-1,3>  - A<I, 

- AO,H,5>  X F<I-l,5; 


COMPUTE  B PRIME 


DO  11  N 
DO  11  M 
H<N, M; 


A, 5 

1,5 

B ( I , N , M J - 

A( I , N, E>  ft 
B(I-i,3,M^ 
A( I , N, 5>  ft 


Ai..  I,N,1^  A 

B< I-l, E, M^ 
- A<I,H,4^ 
BU-X,  3,  M J 


E t I - 1 , 1 , M J - 

« A <:  1 , N , C J X 

ft  B < I - 1 , 4 , M ; - 


iWSERT  LUDEC  ASAIN 


HERE  SHALL  BE  INSERTED  A COPY  OF  THE  FORMER  LUOEC, 

EXACTLY  AS  SHOWN  IN  THE  IMPLICIT  CODE  COMPILATION  BY  SCHAEFFER 


Figure  G.2 


Implicit  Aero 


BTRI  (Cont’d) 


z n* 


Original  FORTRAN 


*5Xc00 
5i50C 
5A^00 
SiSOO 
5X600 
5X700 
5x600 
5X900 
5c000 
56X00 
5E200 
56300 
561400 
56500 
5660  0 
56700 
56300 
56900 
55000 
55X00 
55600 
55300 
53400 
53500 
53600 
53700 
53300 
53900 
54U00 
54x00 
54600 
54300 
544u0 
54500 
54600 
54700 
54S00 
54900 
55000 
55X00 
55600 
55300 
55400 
55500 
55600 
55700 


l55rH<5,3 

l54»H<5*4/-u5xacu4-l56aU64-x53au34 

uH5!=»Hi4,5;-u4XAUx5-L46KU65-u43Au35>AL44 

l55»x,/0h(5|5)-L5XaU*5-u56au65-1.53aU35-l54xu45j 

C COMPUTE  LITTUE  R S 

DX»Lj.XAr  ^ I , x^ 

D6«U6cAir(I«6^**L6j.Api) 
t»3»L33/(i  F < J ♦ » - L3XADl-t36AD6  ) 
D4»l44wiF(I,4j«l4XaDX-L4uaD6-L43ad3> 
U5s^L55A^,^^I,5^-u5XAOX*•L56AD6-L53»D3»•L54}ftD4i 
C COMPUTE  BIG  R’B 

r I , 5 >ud5 
F ^ I , 4 j ttD4-u45AD5 
Ft  I *3;:iDj-U34xF(  1,4  i-u35aD5 
Vi  I t 6 ^ nUc-U23AF ( 1 , 3 ; -u64 aF M , 4 ;-U65aD5 
r*  i,i>-DX~U*i;AFi  i,6/'-Uj.3AF«:  i.3j**U*4AF«.i,4^~ui5AC'5 
C COMPUTE  C FRItlES 

UQ  15  Mt:X,5 
DXsLXXAC  ^ 1 , X 4 IW 
d6-l66AiC. l464M;-u6XAOi) 
d5«l53a^C( I 4 3i M^~t3iAOX-L36A06  t 
D4cL4  4AiC(i4  4,h/-L4XADa-U46xDC'-L43A03j 
D5t:L55AvC'.'l45j(M^*-L5lADi-L56AD6-L53AD3»*L54AD4) 

EC i , 5, M J SD5 
B \ 1 ,4  ,rw-D4-u45AD5 
B^l434M^  i:  U5-u34aBc  I 4 4,  Mj -u35aD5 
B<I,6,MJ  s p6-u63aBi I , 3< M ^-u64aB(I 4 4, M j -Uu5ad5 
X5  SvI^XjM^  - OX‘-UX6aB  ( I 4 c 4 M ^ ~Ux5aB  < I 4 54  M; -UX4  AB  ( 1 4 4 4 M -UX5a05 
13  CONTINUE 

Is  lU 

C COMPUTE  B PRIMEABIG  R FDR  LfiST  RON 

DO  X7  NsX,5 

X7  F*'l4fOtsr(^l4NJ“Hi  IiNjiaJaF  >.  ItNjC^AFCl— Xj6y-fiCl»N43i?>*>" 

X F«.  I*•l♦3>-A^I,l^,4JAF<I-X,4^~rt^,:,N45)A^(  I-X,5> 

C COMPUTE  B PRIME 

00  16  H:=i,5 
DO  16  MeXtD 

13  HvN,  M jnBc  I , N,  H ^ I 4 H,  ^ > aB  < I -X,  X,  H J-fti  I 4 N,  6y  aB  ( I-X,  6,  H ;*-ft  k I , N,  3 .f  A 
xBi  I-X,  3,M  J-H^  I , N,4  jaBC  I-X4  4»  H I 4 N4  5^aB<  I-X4  M; 

C iNSERT  LUDEC  RGAIN 

L*Xsi,/HvxjX> 

L6X  = H (,  c 4 X / 

U^,6s:H<X,6>ALXX 
l66cx,/vH<64  6^~l6*aUX6> 

U*3»HcX,  3 f ALj.1 
Ux4tsH(X*  4 ;xLXX 


Figure  G.2 


Implicit  Aero  - BTRI  (Cont'd) 
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700  c 

790  c 

^00  c 

3X0 

3£0 

330 

340 

350 

360  C 

370  C 

380  C 

390 

900 

9X0 

930 

930 

940 

950 

960 

970 

930 

990 


COIIPUTi:  LJTTUE  R'S 

OX  S LXX  F < 1 , X ; 

uii2  A <,F(i,i;;  - Liii  A ox; 

03  « L33  a ^rO,3;  - L3X  a DX  - L3E  a DcO 

04  » L44  a iF(I,4;  - l.4i  a OX  - U4£  a Oi:  - U43  A 03; 

05  t:  u55a(F(X,5)  - u5XadX  - l5£aO£  - u55ad3  - l54ao4> 

COMPUTE  ons 


r O ) 5>  = d5 
F<I,4"j  » d4  -u45aD5 

F<I,3>  « 03  - U34Ar(l»4j  - U35 

F<I,ii>  ts  Dii  - U123aF<I«3;  - Ui^4 

F(I,X;  - OX  - Ux3aFU,c)  - UX3 

:F  (I  ,L7.  lUft)  THEN 
DO  X5  M = i,5 
OX  » UXXACOtXiMj 
Dii  s Lil£A<C(  I , C,  rU  **  taXAOX) 
03  3 l33a  C < 1 , 3,  M ^ - l5XadX 


u35aD5 

Ui^4AFO«4^  - ULSaDB 
Ux3ArCl,3j  • UX4AFfI,4) 


04  a L44a<C(I, 


l5XadX  - u3SAoe) 

l4xxDX  - L4£aD£  - u43ao3) 


xooo 

05  « L55aCC(I 

,5,M;  - L5 

XAOi 

- l5£aO£  - l53aD3 

- l54aD4j 

iOXO 

8 < I , 5 , M ^ 

05 

iOEO 

8 < 1 , 4 , H ; 

s 04 

- u45aD5 

1030 

8 ( 1 , 3 , M ; 

::  D3 

- U34aCiI 

,4,M; 

- u35ad5 

X040 

8 ( I , £ , M ; 

« 0£ 

- uii:3A0<:i 

♦ 3,M> 

- U£4aB(1,4,M^  - 

u£5 aD5 

X050 

8 I , X , M ; 

a OX 

ui£ao(i 

, £ , H ; 

- U13aB(I,3,MJ  - 

U 1 4 A B C 1 , 4 , M 

X060 

X 

- 

U15AD5 

X070 

15 

COr4TIHUE 

1080 

END  IF 

1090 

13 

CONTINUE 

iXOO 

C 

XXiO 

c 

THIS  IS  THE 

END 

OF  THE  MAIN  I 

LDDPi  INCLUDING  Is 

lUH 

Figure  G.2  Implicit  Aero  - BTRI  (Cont'd) 
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tii5$oo 

5S9oa 

*56000 

iSOlOO 

56200 

56300 

56400 

56500 

56600 

5670  0 

56600 

56900 

57000 

57i00 

57200 

57300 

57400 

57500 

57600 

57700  C 

57600 

5790  0 

56000 

56100 

58200 

56300  C 

56400 

58500 

56600 

56700 

56800 

58900 

59000  20 

59100 

59200  19 

59300 

59400 

59500 

59600 


Ul5tiH<i,  5>ALii 
L31sH^3, 1 ) 

L3c«HIv5,c>-l31au12 

3;*'Uc4,aui3j  22 

t33-l,/(H(3«3"j’*Ui3At31**U23>fU32,i 

U24«i:h(2, 4/«L2iAUl4;*u22 

u25ki‘hU  , 5 ;«UclAUi5  / al22 

L413H^4, 1 t 

L428HI 4»  c ;-l4iAU12 

L43sHk4,3>-L4iAUl3-L4cAU23 

u34»<h<,3, 4;-l3iau*4-u32aU24  ;au3S 

L443  1,  /t,H(;4, 4 /-yi4AL4l-u24«L42-u34AL45  > 

U35B^  3,5>-L3lAUi5~u3£AUi:5>AL33 

L5lBH^5tJl^ 

L5cbH<5,£>-l51aUo.2 

t53BH(5,3;-L5iAUl3'*L5EAu23 

L54«Hi  5, 4^  -u5iA  W 4‘-t5£AU24-t53AU34 

u45b<h<4,5;-L4iau^5-l42au25**l43au35>al44 

l55b1./vH^5, 5 ;“L5iau15~l5£aU£5-L55au35-L54au45 ) 

COMPUTE  LITTLE  R"S 
DlBLilAF  C 1,1^ 

D2sL2£ACr  < I t £ >-l21aD1) 

D3aL33ACr<  It  3;‘‘L3aaD1~l32ad£> 

D4ssL44Kt  r(  r , 4 ;-l4iaD1-L42ad2-L43A03> 

D5»L55A(r( I , 5)-L5iA0i-L5EADc-L53AD3-L54A04> 

COMPUTE  BIG  R’S 

Til, 5>b05 

F< I , 4 )bD4-U45a05 

F<  1 , 3^»03-L'34aF  ( 1 , 4;-u35aD5 

F<I,£^bo2~u23AF(I,3^-U£4aF(I,4  > “UE5AD5 

r<I,i>BDi-Ul£AF<:i,2>-Ul3AF(I,3>~Ui4AF<:i,4^-Ui5AD5 

I = IU 

iBl-i 

DO  19  Msl,5 

F ( I , H > BF  i I t N i ^F  I +1 , 1 > AB  U , N , X ^ “F  < i +1 , £ > AB  < I , N , £ > -F  ( I + 1 , 3 ; AB  < I , H , 3 
A J-FO+l?4;AB^r,N,4^-FO  + l,5>A8<I,N,5^ 
iF  < I , GT, I L ;GOTO20 
RETURN 
END 


Figure  G.2  Implicit  Aero  •-  BTRI  (Cont’d) 
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ii£0 

c 

JIJL30 

c 

NOTE  THE  NESflTIVE  CODE  INCREMENTS 

xmo 

c 

JliSO 

DO  20  I s lUft-1,  1,  -1 

Xl&O 

DO  19  Nai,£> 

1X70 

19 

a F(I,H>  - rCl+i,i)ABO,N 

il30 

30 

CONTINUE 

1X90 

RETURN 

1300 

END 

THE  NEXT  SECTION 


Figure  G.2  Implicit  Aero  - BTRI  (Cont'd) 
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72700 

sueRouTiNe  oureR*; js, je, ks, ke/ 

72600 

C 

XX  OUTER  EDUHOftRY  CONDITICMS  xx 

72900 

cc 

73000 

CCftLU 

fii 

73i00 

CC6LL 

R3 

73200 

CCftLU 

fl4 

73210 

COMMaM/ftll/  RHD(3l,  31,3i;j  ,RHOUC31,3i,3ip,RHDM(3i,31,31) 

73211 

CQMMDH/R12/  RHDMi3i,3i,31J ,Ev31,31, Jl),EI<3i,3i,3l) 

73212 

CaMMDM/rtl3/  Uv3*,3i,3i;, 31, 31,31 >,WC 31, 31,31) 

73213 

i»OHMON/^R3w^  31  } , 0 -j  ^ f , UEx,  UE2,  Ul.,  YE  , 

73214 

, 2 01  ; < OZCELLOl  > , kSi,  KEi,  KS2,  KE2  , KLFM,  KL,  ZT,  2H 

73215 

CDMMDW/R4/  ISHK, iUE,  I E , I L , kx , k2 , k3 , K4 , K5 

73300 

DOHMSTRERM  RT  IsiL 

73400 

OD  X K:sKS,KE 

73500 

DD  jb 

73600 

RHD  CIL,  J,  K 0E,J,K> 

73700 

RHOUt.  IL,J,K/«RHDU<.IE,J,K^ 

73300 

RHOVv  IL,  U,  N ;sJRHUVv  I E,  J,  K j 

73900 

RHOW< iL, J, Ky^RHOWv IE, J, K/ 

74000 

E ^IL,UjKiJ~E  V‘*E,vl,Kji 

74100 

CONTINUE 

74200 

XF(JE, UT, JE2;  GO  TO  3 

74300 

CAAA 

UPPER  B,  C.  RT  J»JU 

74400 

DO  2 K»K<5,KE 

74500 

DO  2 Is2,  IE 

74600 

RHO  < I,  vfU,  K)«RHO  0,JE2,K^ 

74700 

RHOU<  I , JL,  K /«RHOU<,  I,  JE2,  K; 

74800 

RHOU(I,  Ju,K^:sRHOV(I,UE2,K; 

74900 

RHQWC  I,3L,K^-RHUH(I,JE2*K^ 

750  0 0 

E <I,JL,K^^E  iI,JE2,K; 

75100 

u 

CONTINUE 

75200 

3 

CONTINUE 

75300 

IF^KE, LT, KE2 ; RETURN 

75400 

EDGE  e,  C,  RT  KsKL 

75500 

OD  4 J:sJS,JE 

75600 

DO  4 1=2,  IE 

75700 

RHD  iI,U,KE2; 

75600 

RHOU< I,J,kL>=RHDUvI,J,KE2) 

75900 

RHDV( I,J,KLJ=RH0V<I,J,KE2> 

76000 

RHOW<.  I,J,KL^=RHaw<I,U,KE2^ 

76100 

E (I,U,KL;=E  ^^,J,KE2) 

76200 

4 

CONTINUE 

76300 

RETURN 

76400 

END 

Figure  G.3  Explicit  Aero  - OUTER 
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100 
1X0 
115 
JLEO 
iS5 
130 
135 
139  C 

mo 

x50 

160 

170 

130 

190 

200 

205 

209  C 

210 
220 
230 
2H0 
250 
260 


SUBROUTINE  OUTER C JS , UE , KS , KE > 

COMMON/ Ail/  RHOkiOO,  10  0 , 10  0 ) ,RHDUi,i0  0 , 100  , lUO  > , RHOV<10  0 , iOO  , 10  0 ; 
COMMON/A12/  RHOW<i0  0 , lU 0 , iO 0 > , E< 100 , 10 0 , 10 0 > , E I ( 10 0 , 10 0 , 10 0 > 
COMMOH/A13/  U^IO  0,10  0, 100  ),  VdOO,  100  i 100  >,H<  100, 100,100  > 
CDMMON/A3/  Y(lOO>,DYCELLClUO>, JSl, JEi, JS2,UE2,ULFM, JU, YF, YH 
1 , Z<100 ) , DZCELLvlOO  > , KSijKEl, KS2, KE2, KLFM, KL, 2F, 2H 

CUMMUN/A4/  ISHK, i LE , I E , I L , K1 , K2 , K3 , K^ , kS 
DOWNSTREAM  AT  I=sIL 

DDAL.L  KsKS,  KE;  J = JS,  JE  ; US  I NQ/Ail/ , /Al2/ , / A3/ , / AH/ 

RHD<IU,U,K>  u RH0(IE,U,K^ 

RHaU<IU,J,K>  t:  RHDU<IE,vJ,K> 

RHDV(IL,J,K;  = RHDV<IE,J,K> 

RHOW<IU,J,K^  RHOM(IE,J,K^ 

E(IL, J,K)  s C<IE, J,K; 

ENODO  ; QIVINS  /Aii/,/Al2/ 

IF  <JE,LT,JE2>  GO  TP  3 
UPPER  B,  C,  AT  U:5JL 

DDALE  K=KS,  KE;  Ist2,  IE  ,*  US  X NS/All/ , /AX2/ , /a3/ , /AH/ 

RHPCI,JK,K^  » RHD<I,UE2,K) 

RHOU<;  I , JK,  s RHPUa,  I , UE2,  K J 
RHOV<I,UK,KJ  s RHDV<I,  JE2,K> 

RHOW<I,UK,K;  s ftHDW< I , JE2, K J 
E<I,JK,k;  s E<I,UE2,K; 


265 

ENOOO;  GIVING  /All/,/Al2y 

270 

3 

IF  (K,GE,KE2)  then 

275 

C 

EDGE  B,C,  AT  KsKL 

280 

OPALL  JS , JE; 1=2, IE  ; USING 

290 

RHP<I,J,KL>  = RHPiI,J,KE2> 

300 

RHPUCl,J,KL^  = rhpu<:,j,ke2> 

310 

RHOV<I,J,KL;  = RHDV < I , J , KE2 > 

320 

RHPMCl,U,KL^  = RHPWC I f J, KE2) 

330 

E<I,U,KL>  = EU,J,NE2.» 

3H0 

ENOOP;  GIVING  /Ail/,/Al2/ 

350 

EHDIF 

360 

RETURN 

370 

END 

OF  'PHR 

PAGB  rr  j ' 


Figure  G.3  Explicit  Aero  - OUTER  (Cont'd) 
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43200 

43300 

43400 

43500 

43600 

43700 

43710 

43711 

43712 

43713 

43714 

43715 

43716 

43717 
43710 
43719 
43800 
43900 
44000 
44100 
44200 
44300 
44400 
44500 
44600 
44700 
44800 


subroutine:  turbor 

CCflLL  Rl 
CCflUL  R3 
CCRLL  R4 
CCRLL  fiS 
CCRLL  R6 

COMHaW/flJLl/  RHO<31, 31,  31 ) , RHOU  ( 31 , 31 , 31  ) , RHOV  ( 31 , 31 , 31  ) 
COMHOH/fllE/  RHOW<3i,3i,31>,E<3i,3i,31>,EI(31,31,3i) 

COMMON/fllS/  u(31,3l,3i),V(31,31,31>,M<3i,3i,3i) 

CaMHaN/fiX4/  F(2,5> 

COMHON/R2/  PRDICT<32, 5>,R<32> 

COHHON/R3/  YCBDjOYCELLCBI), JSl, jei, JS2, JE2, JLrW, JL, VFjVH 
1 ,Z<3i>,OZCEUL<3i),KSi, KE1,KS2, KC2 , KLFM , KL , ZF , 2H 

CQHHON/R4/  ISHK, ILE, IE,IL,K1,K2,K3,K4,k5 

COMMON/RS/  iSRMHR,  SRMMl,  GRMMPRjCV,  CVi,  stokes,  UU,  CO,  PU,*^H0U,RL,  XU 
COMMDN/R6/  RMUL< 31 , 31 , 31 > 

CVi«l* /CV 
00  1 KsijKL 

DO  1 U»1,JL 
DO  1 lsl,lU 

TEMPaftBS(EI< I, J, K; )ACMi 

IF(K.EG,i)  TEMPs.SaRBSCEIO,  J,i;+Ei<  i,u,2;;jxcui 
IF(J,EG,1>  TEMPS,  5XRBS  (El  ( I,1,K;+EI(I  , 2,  K;  >3«cCV1 
RMULd,  J,K^s<;,270E-08xS®RT(TEMP}«cj«c3)/(TEMP  + 198,  6) 

1 CONTINUE 
RETURN 
END 


Figure  G,4  Explicit  Aero  - TURBO A 
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iOO 

soo 

205 

EiO 

220 

230 

240 

250 

260 

270 

280 

290 

700 

750 

900 

lOOO 

iiOO 

1200 

1300 

moo 

1500 
1600 
1700 
1300 
1900 
200  0 


SUBROUTINE  TURBOR<CU> 

COMMON/Rli/RHD<i00 , 100,100), RHDU<iOO , 100,100), 

1 RHOM(100,i00,100) 

CDMMDN/R12/  RHDW<100 , 100 , 100 ) , E ( 10 0 , 10 0 , iO 0 ) , E X ( 10 0 , 10 0 , 10 0 ) 
caHMQN/RlS/ .UC100,100,100),U(100,100,100),W<1CO,100,100> 
CDMMDN/Rm/  F<2,5> 

COMMON/RE/  PROICTdOl,  5)  , P (101) 

CDMMON/R3/  Y(l00 ) , OYCELL(iOO ) , JSl , JEl , JS2 , UE2 , JLFM , JL , VF, YH 
1 , Z(100  ) jDZCELUIOO  ) , KS1,KE1,KS2,  KE2,  KLFM,KL,ZF,  ZH 

CDMMDN/RH/  I SHK , I LE , I E , I u , Ki , K2 , K3 , K4 , K5 

CDMMnN/R5/GflMMft, QRMMl , GRMMPR^  C V , CUl , STDKES , UO , CO , PO , RHQO, RL, XO 
CDMMQM/ft6/  RMUU<l00,i0O,100) 

DOMftIN  /EXPLCT/: X si , iO 0 J Jsl , 10 0 ; K=1 , 10 0 
INfiLU/EXPLCT/  TEMP 
CUl  s 1,0/CU 

DORLU  JtfUSl, JE2} KSK21, KE2;  USING  /ftl2/,/R5/ 

DO  1 isl, XL 

IF  (K.EG.l)  TEMPaO.SRflBSCEId,  J,l>'kEI<I,U,2))RCVi 
ELSE  IF<  J,  EG,  DTEMPsO,  5RftBS(EI  ( I , 1,  K)+EX  ( I , 2 , K ) ) AC  Vi 
ELSE  TEMPBftBS(EI< I, J, >ACU1 
ENOIF 

RMUL(I,J,K>  = 2, 270E-03ASGRT(TEMPAA3)/TEMP+198, 6) 

CONTINUE 

ENDDO;  GIUING  /R6/ 

RETURN 

END 


Figure  G,4 
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76600 

76700 

76800 

76900 

77000 

77100 

77200 

77300 

77310 

77311 

77312 

77313 
773m 

77315 

77316 

77317 
77313 
77321 
77^00 
77500 
77600 
77700 
77300 
77900 
73000 
73100 
73200 
73300 
73400 
73500 
73600 
73700 
73300 
73900 
79000 
79100 
79200 
79300 
79400 
79500 


SUBROUTINE  LX 
C LX  OPERATOR 

CCALL  fli 
CCftLL  h2 
CCftLL  ft3 
CCftLL  ft4 
CCftLL  ftS 
CCftLL  ft7 

COMMOH/flll/  RHU(31,31,31>,RH0U<31,31,31), RHDV^ 31, 31,31 > 

COMMON /ft 12/  RHaW<3i,31,51>,E(31,3i,3i),EI<31,31,31p 

COMMON/ftiS/  UC3*,3i,3i),Uv3i,31,3i),N<31,31,3i) 

COMMON/ftl4/  F<2,5> 

COHMDN/fl2/  PRDICT<32, 5> , P<32> 

COMMON/ft3/  Y(31),DYCELLv3i),JSi,JEl, US2, U E2 , JLFM , U L , i F , Y H 
* , 2<31),DZCELH:3i>,KSl,KEi,kS2,KE2,KLFM, KL, ZF, ZH 

C0MM0N/ft4/  ISHK,ILE,:E,IL,K1,K2,K3,K4,K5 

COMMON/ftS/  6ftMMft,SflMMl,6ftMMPR,CU,CMl, STOKES, UU,CO,PO,RHOO,RL, XU 

CDMMON/ft7/  DX,DX1,0Y,0Y1,D2,  DZl , E I WALL , I flOB WL  , DT , CFL , CONST 

DTOX=OTADXl 

DO  1 Ki^KSl,KC2 

DO  2 Jt5JSi,JE2 

DO  3 1=1, IL 

PRDICT<  I , l^=RHO  1,1,  J,K^ 

PRDICT< I ,2)=RH0U<I,U,K; 

PRDICTO,  3>=ftHQV<I,  J,K^ 

PRDICTCI , 4>=RHDW<I, J,K; 

PROXCTC I , 5)=E  (I,U,K^ 

p<I)i:GftMMi  «RHO(  1 , J,  K;xE  I < I , J,  K / 

3 CONTINUE 
DO  4 N=l,2 
1=1 

IftOD=N-i 
NMi=N-i 
B=i, /N 
I I = I + IADD 
Uil=U<  X I , J,  K,' 

CALL  FXI.UI  I , I , U,  K,  * I ; 

DO  5 I =2, ZE 

K3=K1  iKl=K2  i K2=K3 

I IsI+IftDD 


Figure  G.5  Explicit  Aero  - LX 
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100000 
100100  C 
100600 
100900 
101000 
101100 
ioieoo 

101300 

iomoo 

101500 

101600 

101700 

101710 

101300 

102000 

102100 

102200 

102300 

102^00 

102500 

102600 

102700 

102800 

102900 

103000 

103100 

103200 

103300 

103400 

103500 

103600 

103700 

103800 

103810 

103320 

103900 


SUBROUTJNfi:  UX 
LX  DPERftTDR 

CDMMnN/flll/  RHOMOO  , lOO  , 10  0 > , RHQU  < 10  0 , 10  0,100  ),  RHDV<iOO  , 100,100  ) 
CDMMPN/H12/  RHD»< 100, 100, 100  ),E( 100, 10 0,100  ), El <10 0,100, 10  0 ) 
COMMON/R13/  U<l00,i00,l00), vao 0,10 0,100 ),W< 10 0.10 0,100) 
COMMQN/Am/  F<2,5> 

CDHM0N/R2/  PRDICT<101,5),P<101) 

COMMON/fl3/  r<lOO  > , P YCCLLUOO  ) , JSi  , JEl , JS2 , JE2  , JLFH  , JL  , VF  , YH 
1 , 2<i00),OZCELL<l00),KSi,KEi,KS2,KE2,KLFM,KL,2F,ZH 

CDMMQN/R4/  ISHK, I LE , I E , I L , Ki , k2 , K3 , KH , k5 

C0HMDN/R5/  QftMMfl,  QAMMJL,  GRMMPR,  CM  , CVi  , STOKES  , UO  , CO  , PO  , RHOO,  RL,  XU 
CDMMDN/fl7/  OX,DXl,DY,OYi,OZ,  021 , E I MRLL , I ft OBWL  , DT , CFL , CONST 
DOMftiN  /EXPLCT/:lsl,iUO;j=tl,  10  05K  = l,10  0 
DTDXsOTAOXl 

DORLL  J:=JS1,  JE2;KcKS1,KE2;  using  /f.ll/,/Rl2/,/fti3/,/Am/,/R2/,/R4/ 
DO  3 Isl,  IL 
PRDICT< 1 , l)sRHO 
PRDICT(I,2>»RH0U< I, J,K^ 

PROICT< I , 3)sRHOM< I , J, k> 

PROICT( I, 4)sRHOH< 1 , J, K> 

PR01CT(I,5>sE 

P<I)»GftMMl  )«cRHO<1,J,K^AEI<I,U,K> 

3 CONTINUE 
DO  4 Nal,2 
xsl 

IRDD»N-i 
NMisN«i 
B=l, /N 
I I=l+IftOD 
UI  Is:U<  I I , J,  K; 

CALL  FX<UU  , I , J,  K,  I I ^ 

00  5 1=2, IE 
K3  = K1 
KA  = K2 
k2  = K3 
1 1 = 1 + 1 ROD 
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?9eoo 

79700 

79300 

79900 

80000 

80i00 

80300 

80300 

80^00 

80500 

80600 

80700 

80300 

80900 

81000 

81100 

81300 

81300 

emoo 

31500 

81600 

81700 

81800 

81900 

83000 

83100 

83300 

33300 

83400 

83500 

83600 

33700 

83800 

33900 

83000 

83100 

33300 

63300 

83400 

83500 


u* isuC i I , J, K; 

Ulls:u(  i+i,  j,  k; 

UX3cu< I , J, K; 

XFCUIl,(3T,Ul3,rtHO,C3,AUIl-Uia)A(3.AUl3-UIi),LT,0,^  UII».5j^CUIi  + Ul3 

X > 

CfiLL  FXt UI  I , I , J, K,  I I > 

PROICT(I,1>  = CNM1)«cPRDICT(I,1MRHD  (I,J,K^-DTDXA<F<K3,i)-F(Ki,l^^)AB 
PRDICTC  I , 3)  = 0’JMixPRDICT<  I,  3 WRHDU<  I,  J,  K^-DTDX>c(F(K3, 3)-F<K1, 3>  >)*B 
PRDICT( I , NM1XPRDICT< I , 3 > +RHD V ( I , J , K ^ -OTDX A < F < KE , 3 ) -F ( K1 , 3 ) ) ) AB 
PR01CT<  I , 4>,sv;nM1APROICT(  I , 4>+RH0W<  I , J , K > -OT  OX  A < F < k3 , 4 > -F  ^ K 1 , 4 ; > > AB 
PROICT(  X,  5>:=(NM1APRDICTC  I , 5>«fE  < I , J , K > ~0  T DX  A C F < K3  , 5 > -F  < K1 , 5 > ) > AB 

5 CDr4TINUC 

c 

C X DECODE  X 

DO  6 1=3,  IE 
RHDI=i./PROICT(I,i) 

U , J,K^=PRDICTC1,SJARHDI 
M w , J, K ^=PRDIC7( I, 3)ARHDX 
W <1, J,K^«PRDICT<I,4JARH01 

EX  < I , J,  K/-PROICTO  ,5)A  RHDI  -.SACUd,  J,K;Ax3+V<I,  J,K;AA3  + Wk  I 

X , J, K ;xA3  > 

P< I ) =SflMMlAPRDICT< I ,1;AEI<I,J,K> 

6 CONTINUE 

CAAA  xOOWNSTREflM  B,  C.  RT  I=IU 
DO  9 K6«1,5 

9 PRDICTC XL, K6^=PRDICT( IE, K6> 

CftUL  BCY^,K,  3,  XE,  J,  J> 

4 CONTINUE 
DO  ? I=3iIL 

RHO  < I , J, K>=PRDICT( I , 1> 

RHOUk I ,J,K;=PRDICT<I,3> 

RHDUC I,J,K^=PRDICT<1,3> 

RHOM< I ,J,K>=PROICT<I,4i 
E <1, J,K>=PBDICT<I,5^ 

7 CONTINUE 
3 CONTINUE 
1 CONTINUE 

Cftl-L  0U7ER<  JSl,  JE3,  KSi,  KE3  ) 

RETURN 

END 
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REPBODTJCffilLlTY  OF  THE 
ORIGINATE  PAGE  IS  PO-  K 

10^000  UI lsu< 1 X , J, K> 

xomoo  UIi  = U<  I + l,  J,  K; 

i04£00  Ul2ru<I,J,K> 

i0^30  0 ir<UIi,<3T.Ul£,flND.<3.xUIl-Ul2))ic<3,AUl£-UIl>.‘*T,0,>  UIIr,5><<uiJL  + Ul£ 

JL04400  X > 

10H500  CALL  FX U I I , I , J , k , i I ^ 

10H60  0 PRDICT<X,l)s:<NMlXPRDICT(l, i>+RHD  < I , J , K ^ -OTOXX ( F ( K£ , i ) -F < Ki , i > > )XB 

10^70  0 PRDICT<  t , £)s=<NMiXPRDICT< 1 , 2)+RHDU( I,J,K;-DTOXXtF(K£,£)-F<Ki,£)))XB 

10480  0 PRDICTC I ,3>s(nm1XPROICT< 1 , 3 ) +RHOV (I , J, K ^ -DTDXX ( F < K£, 3 > -F ( Kl , 3>  > 

104900  PRDICT(X,4>:J<NMlXPRDICT<I,4>+RHPW<I,J,K;-DTDXX(F(Ke,4)-F<Kl,4)>>XB 

10500  0 PROICT< I , 5)  = ^ NM1XPRDICT< I ,5)  + E <I,J,K>-DTDXX<F<k£,5)-F<K1,5>>)XB 

105100  5 CONTINUE 

105200  C 

105300  C A DECODE  A 

105400  DO  6 I=e,  IE 

105500  RHDI=1, /PRDICT< I, i> 

105800  U ( I,U,K;spr0ICT( I,£>ARH0I 

105700  V Cl, J,K>sPRDICT<I,3)ARHDI 

105800  W < 1 , J, K^spRDICTC I , 4; ARHDI 

105900  El ( I ,U, K;sPRDICT< I , 5>A  RHDI  - . 5A < U < I , J , K ^ k A£+V C I , J , K ^ x aE+H C I 

106000  A ,J,K^xaE> 

106100  P(l>  sGRMMIAPROICTC I , i)AEI ( I , J, K ) 

106EOO  6 CONTINUE 

106300  CAAA  XODWNSTRERM  B,  C,  NT  ISJL 
106400  OD  9 k8«1,5 

106500  9 PRDICTC IL, K6^=PRDICT( IE, K6) 

106600  CALL  eCYCK, E, IE, J, 

106700  4 CONTINUE 

106800  DO  7 I=£, IL 

106900  RHO  ( I , J,  K;!=PRDICT<  I , 1) 

107000  RHDU( I , J, K^=PRDICT< I , E> 

107100  RHDVd,  J,K)uPRDICT<1,3) 

10  720  0 RHDW<  I,  J,  K;:iPRDICT<I,  4p 

107300  E d,  J,k;:=PRDICT<I,5) 

107400  7 CONTINUE 

107600  ENDDO;  <3IUING/Ali/, /AlE/, /A13/, 

107700  CALL  OUTERCuSI, JE2,KS1,KEE> 

107300  RETURN 

107900  END 
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39800 

89900 

90000 

90i00 

90200 

90300 

90400 

90500 

90600 

90700 

90300 

90900 

91000 

91X00 

91200 

91300 

91400 

91500 

91600 

91700 


KflPftPXsKflPft+X. 

caRxauis  tdrce 

CkXAA 

FXCOS,  l255*tOT 
DO  3X30  L=l,NLftY 
I Mx=:  IM 

DO  3130  t^Xf IM 
f‘D(l,  I>=U. 
rO<  JM,  ; >:=0  , 

DO  31X0  >J«2,JMMi 

3110  FO<J,  i >:sF<  ADXYP<  J/  + , 253«C<UC  J,  :,U;*rUvJ,  I MX  , L ^ *f  U C JtI  , X , L ^ ♦ 
X U(J*ri,IMl,L^  ;wkOXU<  J;-DXU(  JtX)  ^ 

DO  3120  Js2,JM 

flLPHsFXC0)«c(p<  j,  X ;tP(  j-x,  i;;acfd<j,i;+fd<j-i,i^> 

UT<  J,  X , U^SUT  C J,  i , L^*rfttPH)»CV<  J , * , L> 

UT(J,XMx,L;«UT<J,  I Ml , L ^ tPLPHxM ( J , IMX, 

, L^-ftUPHxU< J, I , U; 

3*20  yT<J,XMx,L>sg7(J,  iMX, L;-ALPHXUv J,  X Mi, l; 

3x30  IMX2I 
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REPRODUCIBILfl'^'  OF 
OBIGINAL  PAGE  IS  Pn<  ,h: 

90000  C 

90100  C THIS  IS  THE  SECTION  OF  GISS  , CDHFg  THHT  MAS  SIMULATED 

90E0  0 C 
100000  C 

iOOiOO  C CORIOLIS  FORCE 

100200  C 

100300  DDALL  J-E, UMMi; 1=1, IM 

100400  IF  ( I . EG, 1>  THEN 

i00500  :mi=im 

:.00600  ELSE 

a.00700  IMI  s I 

100600  EMOIF 

100900  FD(1,  1)  s 0,0 

101000  FO<UM,IJ  0,0 

iOilOO  DD  100  LsljNLAY 

iOlEOO  C 

101300  C HERE  THE  COMMON  SUBSCRIPT  EXPRESSIONS  ARE  NOT  GIVEN 

101400  C BUT  THE  COMPILER  IS  ASSUMED  TO  HAVE  EXTRACTED  THEM  APPROPR 1 ATEL Y . 

101500  C 

102200  ro<J,I>  » F<J;  T OXYP<J>  r 0 , 25a<U < J, I , L >tU< J, I-i, L;tU<U+1 , I , L ^ 

102300  1 + U<  Jf  I,  l-l,L>  )X<OXU^.vl>-OXU^  J-rl)  ) 

102400  100  CONTINUE 
102500  ENDOO 

102600  OOALL  JM; I=I , XM 

102700  IF  <1,EG,1)  THEN 

102800  XMl  r;  XM 

102900  ELSE 

103000  IMI  IS  1 

103100  EHDIF 

103200  DD  200  L K 1,HLAY 

103700  rXCD  - 0,125AOJ 

103800  ALPH  « FXCO  <p(  J,  I )+P(  J-i,  I ) )A<FD<  J,  I ;AFD<  J-i>  I ) ) 

103900  UT(J,X,L;P  « UT<U,I,L^  t ALPHR  V J • I , L / 

104000  UT(J,IMl,L,i  a UT<J,XMi,L>  r ALPHRV  < J , I Ml  , L / 

104100  VT(J,I,L;  - VT<J,I,L;  - RLPHXUC J, I , L/ 

104200  VT<U,IM1,L^  K VT<I,IM1,L;  - ALPHRU V J , I Mi , L ^ 

104300  200  CONTINUE 
104400  ENDDO 
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9X^00 

9X900 

9^000 

9^i00 

9££00 

92300 

92H00 

92500 

92600 

92700 

92800 

92900 

93000 

93X00 

93200 

93300 

93^00 

93500 

93600 

93700 

93800 


CXAX)V 

VERTICAL  ADVECTIDN  DT  T HKRMDD  YHRM I C ENERGY 

C aaAA 

DO  3180  L«l,NLAYMi 

LRisLTl 

DO  3180  I=:Jl,  IM 
DO  3180  JuXfJtl 
RLinPTR0P  + 5r6CL)AP>CJ<  I > 

PL2nPTR0P+S:GCLFl>APf J, i) 

PKisiEXPBYK<PUi  ; 

PK£sEXPDyK^PL£> 

COisOSIGCUPl )/ ^D£I G<L)tDSIG(UP1 ) ) 

CQ2si,-.CQi 

TETAMnCOlAT ( J, *,L^/PKAtC02aT<J, I,LP1)/PK£ 

TT(J,i,L;iiTT<J,4,L^fDTA^£IG<L  # AKftPftAPC  J, * )AT(J,I,L;APIT<J,I^/PLi 
3-K  J,  I , U > ATETAMAPkl/0£lG(U  M 

7T(J,  I,LPi;isTTvJ,  i4LFX;«fDTASD(J,X,U^ATETftMAPK2/DSt6(Lj 
ir<LPi.  EG,  NLAY  ^ TT(  J,  X,  LPi)siTT(J,  I , LPi  ; +OTAS  I 6 < LPl  > AKftPAAP  < J , I >A 
A T< J,  I ,LP1  )APIT( J,  I )/PL2 
3180  CONTINUE 
C AAAA 
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i0^500  C 
J.04600  C 
X04700  C 
X04800 
104900 
108100 
105E00 
10S300 
105400 
105500 
105600 
105700 
105800 
105900 
106000 
106100 
106S00 
106300 
106400 
10650C 
10  66011 
106700 
106800 
106900 
107000 
107100 
107800  300 
107300 
107400  C 
107500  C 
107600  C 


VERTICAL  ROVECTIDH  PF  T HE RMPO VHRM I C ENERGY 

DDALL  Xsl,  2 M 

DO  300  LslfNLHYMl 
LPA  » P( J, I ) 

LSI6A  » Sl<a(L> 

LSIGB  2 SI6(Lfl) 

PKl  c EXPBYK<PTRDP  + LS*1GAALPA> 

PK2  s EXPBYK(PTRDP  + LSIGBALPA) 

LDS16A  n 0£1G(L^ 

LDSI6B  & DSIG(LTl) 

LTft  s T( J,  X , L > 

LTB  s T( J, X , LtI^ 

LSOR  s SniJfl fL; 

UPITft  a PIT(vJ,  2 > 

COl  a LDSIGB/^LDSZGA<I>LPSXGB) 

C02  a 1,0  - COi 

TETAM  a COlALTA/PKl  + C08RLTB/PK£ 

U7TA  a TT<J,*,L^  t DY A < US I 6AAKAPRRLP A ALTAAUP I TA/PLi- 
1 USDAATETAMXfPKl/LDSIGA) 

LTTB  aTT<J,K,Lfi^  + DT AL SO AA T E T AMRPKE /LDS 1 6 A 
IF  <LP1, EG, HLAY;  LTTBaLTTB  + OT AUS I 6B AKAPAALPAALTB* 

1 LPITA/PLE 

TT(J,I,L^  a LTTA 
7T(J,I,Lt1)  a L7TB 
CONTINUE 
ENODO 

CDMpe  CDNTINUES  BEYOND  HERE^  THIS  IS  THE  END  OF  THE  PIECE  SIMULATED 
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u 3 5 c i)  0 
w3&30  0 

a35500 
u 3 5 6 0 0 
a35?(i0 
L3  360  Cl 
C35900 

aseooo 

a36c00 
u 3630  0 
3364 U 0 
336500 
336600 
33670CI 

u .>66  0 0 

336900 

337000 

337100 

357300 

337300 

337400 

c37500 

£37600 

£37700 

£37300 

£37900 

£33000 

£33100 

£33200 

£35300 

£35400 

£35500 

£35600 

£35700 

£35800 

£85900 

E36000 

£36100 

£86c00 

£86400 

£36500 

£36600 


^MBROUYIia’  LiMKHO 

C A A‘  *:  A A 

CjvAAAAlNTERFftCe 

CUMMmi/”RftDCOM/PL  t 9 J ♦ I-  Lt  i lO  J , f LK\  V . , T«5, 7:?,  7L  ^ 9 ^ , Y ^7K  i 3 J ♦ ZHL  i 9 J , 

V C L U U 0 \ i £ J , h E I 11*  ^ , K E S T P < 3 > ♦ r L L<  M 4 , i:  G « H £ ..  9 M H iJ  S T P.  f 3 > , i 0 ♦ C 0 5 2 I R S U R r 
A,iC0£2,RRR,HHt'1 

CmiMON/CLDCUM/£IJRLt  t it-  • , £Wi  L \ i 5 J , hL  ^ x 6 J « Y ft  U L i iO  ^ , 02flLE  i,  i 6 / ♦ TOP  AB 

C A A A A A 

CaxaaAGRIO  ARRftVS  • STORAGE  FPOBLEM  OU  £TAft^ 

LUGiCAL  CLDFUG,  hEKFLG,  Li.  « L£ 

iNYEGER  iTY 

PEAL  A*AAA,£EtfcE,Ulli.,TAUA,TAUY,AA,EB,CC,THSG*TNt 
A 7AU«  EOrjCN,  YC'FCN  . KDNCn,  YAUC  IR  , F i U » EXTAU,  7 V , AEFl  t AERc  , AERA» 

AERU,  AERM.  HERC*  EXl , E:.£,  C'CHU,  DHMD,  DMrti.*  REFUP,  FEFDr4 
D ZMENS  i DN  UNHtiO  ^ i£  > , UIJLOE  < Ic  J » UN03  i i£  J , F <:  l£  J ♦ E i i£  , NCLUUD  ^ 1£  ^ , 

A T E i 1 4 ; , fc  T D P ( 14  I , r E i X 3 ‘ , Y A U rj  I 1 £ , 3 ^ , F K Q A 5 C 3 ) , F F“  G A S £ ( 3 ^ • 

A EUPi  ic  » EDIU  x£  > t EUF  C ( ic  > » EDNC  l£  > , Tt'F  i i£  ; , TDFCi  1£  > , 

A REFC  1£  ; , PDNC(.-.£^ 

EOUiMALEHCE  (FKGflSi  1 > , EUP«;  1 ^ . FF  GASE^i  ) , EUP^.4  / 

EOUiVALENCE  i.A,EDHCNJtiAAA^TDFCI4  i»  v£E,PO(ICNJ, 

A t EE,  TAUC  XP  ) » 1 UNX, F iU  f ^ v Y A UX , E X V AU  ) , v TAUT*  YY  ) » 

A ^.AA<  ITVJi«  I AERX4&BP,  tAER£,CCj.»  , vY44£0?,AERAJ 

C AA  AA A 

CaaAAa£CALAR  ARRAVs  ^tables  dr  used  FDR  I N I T I AU  I 2 A T I ON 

DIMENSION  CB(X£,x£<j,FX2<l£,i£j,TAti£,i£;,FFi(i£),FF£(lt;* 

A YEMP  ( £3  > , TE3  C 30  1 ; » OU  vii;,FIAERD(x£,il/«  NRERD  < X£  > , 

A ACOSBR<  i£,  £>  * AEREXTi  Xc,  £ 4 * HTAU55i.  £ / » P XCIRQt  X£  / , 

A CIREATvl£;,CCO£BR(lc),COCLAHtXcj,COEKl3^ 

DIMENSION  SH£0  < 3,3),  BH£d  <3 ,3>,NKi5,3j»A1C1£»3j,a£v1£,3j,a3CX£,3.- 

AA4'(x£43"j»tj.(.3t3,£j%B£v3,3»£^»B3i’-,iJ',£;',CXik£«3),Cci.c<|3vtNK£0t£,Lj' 
COMMON  /EINT/  TEJ 
DO  XOi  N^XjNLrt'iPS 

C AA  AA 

CaAAA  single  LAVER  COMPUTATION 
caaaa  eupcupwards  emission 
CaAAA  EDNsDDUNWAROS  emission 
CAAAA  TDFsTRANSMI SS ION 
CAAAA  KEFcRE  _ECTION 
C AA  AA 

NCCtiNCLPUD  < N ; 
r4rtER«NAERD<N; 

TAUCIRsiNCLOUD^NJAtC*REXT<LAU*ACTAU55) 

7AUH<,r4,FOaTAU»4^N»K*i“rAUCIR 
■.XxTRUUkN,  K ; 
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iOOOOO  subroutine:  UiNKHO 

xOOiOO  CDIIMOH  /RADCQM/PLv9>,PLE^iU>.FLK^,9J,TS,T£,TL^9>,TSTRC3J 

.^00^00  i , , CLOUD  i£  > , RE  (JIO  >,RESTR<3^  . FL ON  S , S 6 , rtS  ( 9 ^ , RSSTR  ( 3 > , 

100300  E SC,COS2,RSURF, £COSZ,RftPiRfiH 

100400  COMMON  /CLOCOM/  SWftL  E C i 6 > , SW  I Ul  15  > , RL  C 1 6 j , TftUL  1 6 ; . D2RLE  < 1 6 ) , 

100500  1 VDPRBS 

100600  LOGICAL  CLDFLG , RERFLG , Li , Lc 

100700  REAL  TAUCIR,CTAU55,X,PID,TN,AERl,fiER£,AERfljAERC,AER<»,AERU, 

100600  1 EXl, EX£,UEN0 ,DNM0 , DNM I , RERU , EX7RU , TAU , RDMCN , E DHCN, TDFCN, 

100900  ^ EUPCN, EONCH 

101000  INTEGER  HCLOUD(  12)  , WAERDt.l£> 

iOilOO  REAL  CIREXT(l2),TAUN(12,3),PICIR<i£),PI2<l£,l£>jC6fi£,l2>, 

101£0  0 1 ETDP<m)rfOF(l£),  REFU£>»  EUP<i£>,EDN<i£),  TE3(30i),EUPC(l£), 

101300  £ E0NC(1£  J , TDFC<1£)  , RDNC(1£'; 

iomoo  C 

101500  c additional  declarations  not  usf.d  in  the  simulated  portion 
101600  C ARE  omitted  FOR  BREVITY 
101700  C 

iOiSOO  C statements  ABOUT  PARALLELISM  ARE  OMITTED  ALSO  SINCE  LINKHO 
101900  C IS  CALLED  AS  A SUBROUTINE  MITHIN  THE  INSTANCES  OF  THE 

lOEOOO  C DOALL  /LAYERS/  OF  C0MP3,  IN  THIS  CASE,  EACH  INSTANCE 

102100  C CALLS  LiHKHD  INDEPENDENT  FROM  ALL  OTHER  INSTANCES  AND 

10££00  C USES  A LOCAL  COPY  OF  CODE  HITHIN  THE  PROCESSOR  IN  WHICH 

102300  C THE  INSTANCE  RESIDES,  SEQUENCING  OF  THE  EXECUTION  WITHIN 

102400  C THIS  SUBROUTINE  IS  SOLELY  DEPENDENT  ON  THE  INSTANCE  AND 

102500  C LOCAL  DATA,  NOT  PN  ANY  OTHER  INSTANCES, 

102600  C 

102700  DO  200  LAM  :=  1,12 

102300  DO  100  K K 1,3 

il.2900  DO  lOi  N lyNLAYRS 

103000  NCC  NCLOUDCN; 

103100  MAER  s NAERO<N» 

103200  TAUCIR  a CIREXT<LAM;  « CTAU55  n NCC 

103300  X s TAUH<N,K;  t TAUCIR 

10340  0 TAUH<N,K,J  X 
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£86700 

£86600 

£86900 

£87000 

£87100 

£87600 

£87700 

£87300 

£83300 

£88350 

£83700 

£83300 

£89100 

£391^00 

£89500 

£89600 

£89700 

£89800 

£90600 

£90700 

£90800 

£90900 

£91000 

£91100 

£91£00 

£91500 

£9E100 

£9£400 

E9£700 

£93000 

£93100 

£93800 

£93500 

E93HU0 

£93500 

£93600 

£93700 

£93300 

£93900 

£9^000 


P IO«PIZERD<N, K } 

CX5<<A0<5A  AyicyicA  A AA>kK)«(3«*AKAAA«5«(}lC>ICJ«CA)«C3lC)«CA*A:«CllcycA*A*AA/t3*C«y(:«C*J«C3«CXA)«C:«<«A:«CA3«C3«C)«tA3«C)«C**J«t** 

C IN  CASE  CODE  DIDN'T  60  THROUGH  PROPERTIES  CftLCUUfiTIDN 
C SET  FID  TO  ZERO 

caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 

:r<H,LE,3;  TNsTSTRi  N ;/£73, 
thstlvN-3;/£73. 

iP ( TN , 6E. , 85348, RND, HCLDUDvN> , 6T.0>FIU=U, 

I F < F I 0 , 6T , 1 , E-O 4 ^ GO  70  160 
XF  O^CC,67,U;  60  70  10£ 

CAAAAUPWftRD  FIND  DDHNWHRD  FL*UX'EDN<N^  OF  S I NGL  E < E UP  ( N ; , E DN  N ^ ; FIND  COMPOS 
jlEE  IFI^X,  L7,  i,  E-04>  go  TO  i03 
XF  GT  , i5,  EU  > GO  TO  104 
Xis-TfiUtKN,  K/ 

EX7fiU=EXP(XX ) 

CXAAAACLEftR  LfirER-“PIU,LT.l.E-4 
7 Vs£0 , EOxX 
X TY:sT  V + 1 , EO 

7DF<fNi=TE3  0TY/  + ^TY-iTY  + l>A(  TE3C  I T Y + X ) - TE3  < I T Y ^ > 

GO  TO  105 

104  CONTINUE 
EXTftUKU , 

VDF  <N  ,Js0  , 

105  REF(Ny«0 , 

OresvBTDPUJ>“*sTOPvN  + li>A6,6667E**Oi 
FGRftDsDFBAvd,  0 *•  E X T ft  U j /X- T OF  ^ N ^ ) 
ftNS=X,0-TDF( r 

EDNl  N ^sBTDP<  N*rx  j AftNS  + FGRflD 
EUP(H>:=BTDP<NjAriHS-FGRftD 
60  70  109 
xC3  7DF(N>“1,0 
KEFC N j=0 . 

EUPkN  > = 0 . EO 
EDN<N  , EU 
GO  TO  109 
1 0 £ 7 D F < M > ~ 0 , 0 

REF<N  J = 0 , 

CUPi N JuBTOP( N^ 

EON  i,  N > ::BTDPC  N+i  ^ 

60  70  109 
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103500 

PID  s(TflUCIRAPICIRP<LflM>  + P I Z < UflM , N ) ) / < X+1 . E-' 

103600 

if<n,gb,4>  then 

103700 

TM  « TL<N-3>/273, 

103600 

ELSE 

103900 

TN  =:  TSTR<N>/273, 

104000 

ENDIF 

iomoo 

IF  (TN, GE, 0 ,85348  ,ftND,  HCC , GT , 0 ) P I QsO , 

iOHEOO 

xf(pid,gt,i,e-4>  then 

104300 

flERl  =1,  - PID 

104400 

RER2  SI,-  (P IDRCB<LflM, N; 

104500 

RERR  s SGRT(RER1/R£R2) 

104600 

RERU  s (1,  - RERR)/2, 

104700 

RERV  s <1.  + flERfl)/2. 

104800 

RERC  s SGRT( 3, ARERl:fcflER2) 

104900 

.<1  s -(RERCAX^ 

105000 

EXl  s 0,0 

105100 

IF  <Xi  ,GE,  -180,218)  EXi  = EXF(Xl) 

105200 

IF  <EX1,LT,1,UE-30 ) EXisO.O 

105300 

EX2  s EXlAEXl 

105400 

DENO  s 1, /( (RERVRRERV ) - VRERUARERUREX2) ) 

105500 

DNMO  s (<BTDP(N)  - BTDP(NtI) /< XARERC) )A 

105600 

1 

(<RERV  - RERUREX2)  - (RERRREXl)) 

105700 

DNMl  s RERM  t RERUREX2 

105800 

EUP<N)  s (8TDP<N ) ADNMl  - DNMO  - BTDP < N+1 ) AEXl 

105900 

1 

oenOmcrerr 

106000 

EDN(N)  s (BTDP(N  + 1 >)«(DNM1  + DNMO  - BTDP  < N ) REXi 

106100 

1 

DENOkRERR 

106200 

REF<N)  s RERUxflERVXt(l,-EX2>AOEH0 

106300 

TDF(N)  s (RERV-RERU)ADENOREXI 
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£94100  C«3«£:*£3*£EMISSIDN  CflUCULftTIDNS  FUR  HAZE  LAVER, EXACT  IN  THE  SEUP<N^SE  OF  ISO 

£94£00  C3fk)»t3*C3«cIC  SCATTERING 

£94300  Cw^xytEXACT  SOLU? I ON»TND-STRE AM  SDLUT I ONxFORGE  FACTDR < P I 0 , T AuD ) 

£944U0  x60  AERA  = SQRT  < ( i , -F  1 0 > / 1 1 . -P  I Dj»cCB  < LAM  , N > > ) 

£94500  AERU  s < i- HE RA  ) / £ . U 

£94600  AERV  a <i+AERA)/£,0 

£94700  AERC  = SQRT < AER£r3 . 0 w AERl > 

£94800  EXl  :i  EXPC -AERCRTAUNCN,  K J ) 

£96600  IF  <EX1  »LT,  x,E-30>  EXisO.O 

£967.00  EX£  = EX1AEX1 

£96800  C^r;»cA;*c*FDRSE  FACTOR  FDR  ISOTROPIC  SCATTERING 
£96900  ftwo:5X,eu 

£97000  CAAa«  PIO£=PIOapiO 

£97100  CAAA  FTWDsl,O+0,i4AEXTRU-rU,iKPIU£A(l,U“EXTAUJ+<“l,03+U,4019«Pl0+0,663lA 
E97£00  CAAAXPlO£>AXAEXTAUi‘C£,  0x7£‘*0 . 680  4)*tP  I 0 -X  . 3597AP  1 0 £ ) AX  AXAEXTAUREXTAU 
£97300  denO  = (aer^xaE  *-  aeruaxc)*EX£ 

£97400  DNMU  :=  ^ B T OP  C N J - b TOP  f N + X > ) /T  AU  H ( N , K ^ > / AERC  A 

£97500  X ( AERV-AERUxEXe-AERAAEXl) 

£97600  DNMX  Si  AERM  -r  AERUAEXE 

£9870  0 EUP<N;skBTDP(N ; xO NMX- DNMD-BTOP ^ N-r  1 >AEXX)/OENOAFTWDAAERA 

£9940  0 EDN(N  JiSJveTDP^  HtX  ; ADNMX  + DNMO-BTDP  < N ^ AEXl  ^ /DENOAFTWOAAERA 

300100  CAAAA*REF<N> , TDF^N / BASED  ON  TWO  STREAM  SOLUTION 

300  £0  0 REF<N  )sAERUaAERUA^.1. 0-EX£>/OENO 

300800  TOF( N)s^AER V- AERU > /DENO AEXX 

301E00  CAAAA 

301300  CAAXA  FORM  TOP  COMPOSITE  LAYER  (AOOITIDH) 

301400  CAAAA 

301500  109  DEHOs:!.  U“RDNCN«REF(N^ 

301900  EUPCNsEUPCNf<,EUP<H  J + E O H CH  A REF  ^ N ) ) A T 0 F C < N ) /D  ENO 

302000  eoncns:eon<n^  + ^,edncn  + eup<h;ardncn)atdf<nj/deno 

30E600  I FtHCLDUDvN / , GT, 0 > CLOFLGs , TRUE , 

30£900  CAAax  set  aerosol  flag  if  cirrus  clouds  <,HXGH  ALBEDO) 

303000  IF<CLOFLG.  AND,  PIO  , GE.  X.  E-4>  AE  R F L 6s:  , T R U E , 

303400  CAXXA  TRANSMISSION  CDIIPUTED  DIFFERENTLY  FDR  3 CASES 

303500  IF  (CLDFLG. OR, AERFLG)  GO  TO  X£5 

303300  CAAAA  CASE  X,  ATMOSPHERE  HAS  NO  AEROSOLS  OR  CLOUDS  THRU  HERE 
303900  CAXAA  USE  EXPONENTIAL  INTEGRAL  APPRDXIMAIDN 
304000  TAU=TAUtTAUN<N, k ^ 

304100  CAKAA  PROTECT  AGAINST  TABLE  DUERFLON 
304200  :F(TAU,  GT.  15.  .>  GO  TO  i£4 

304300  7Y=:£0.xTau 

304400  ITYSTY+X, 

304440  XFUTY.LT.x^  IT.=X 

304500  C T0FCNsTE3( ITY ; + ( TY-ITY  + i ) A(TE3C ITY  + i ?-TE3(  ITY J ) 

305300  GO  TO  i£5 

305400  x£4  TOFCNcO . 

305500  x£5  IFC . NOT. AERFLG)  60  TO  X30 
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106305 

106310 

106315 

1063E0 

106325 

1C6400 

106500 

106600 

106700 

106800 

107400 

107450 

107500 

107600 

107700 

107800 

107900 

108000 

108050 

108100 

108200 

108300 

108400 

108500 

108600 

108700 

108800 

108900 

109000 

109100 

109200 

109300 

109400 

109500 

109600 

109700 

109800 

109900 

110000 


ELSE  If  (NCC.GT.O)  THEN 
TDF<N>  K 0,0 
REF(N>  » 0,0 
EUP<N)  » BTDP<N) 

EON<N;  - BTaP(N+i) 

ELSE  IF  < X,LT,l.E-4)  THEN 
TDF<N;s1, 0 
REF(N)  « 0,0 
EUP(N)  3 0.0 
EDN<N)  3 U,0 
ELSE 

IF  (X  ,LE,  15, 0>  THEN 
EXTfiU  s EXP(-X> 

ITY  3 Xa£0,  f 1. 

TOF(N)  3 TE3(ITY;  + (TY~ITY  + iJ  < TE3  < I T Y + 1 > - TE3  ( I T Y ^ ^ 

ELSE 

EXTflU  3 0.0 
TOF<N>  3 O.U 
ENOIF 

REF<N>  3 0,0 
XI  3 1,U  - TDF(N^ 

x2  3 (<i,0  - extru)/x-tdf(n;  > « ^(eTOP<N;^  - btop<n+1))a 

i 0,6666) 

EON<N)  3 BTOP<Ni*i)AXl  + X2 
EUP(N>  3 BTDP^N.J)«tXi-x2 
EHOIF 

DENO  3 i,0/<i.U  - RDNCH3«cREF<N;  ) 

EDNCN  3 (E0NCN  + EUP<N>3*RDNCN)  A TOF(N;  A DENO  + EON<N; 

IF  <NCC,GT,0>  CLDFLG  3 .TRUE, 

IF<CLOFL6. AND. PIO. 6E. 1, OE-4)  RERFLSs , TRUE , 

IF  C , NOT, (CLDFL6. DR. AERFLG) > THEN 
TAU  3 tAU  + X 
IF  »;7AU  ,GT,  15)  THEN 
TDFCN  3 U, 

ELSE  IF  ( (20 , aTAU+1. ) , LT, 1)  THEN 

XTY31 

TOFCN  3 TE3(  ITYj>  + (TY-ITY  + i^A(TE3(ITY  + l)-TE3(  ITY)  ) 
EHDIF 
ENDXF 
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305600 

CASE  2,  SIGNIFICANT  ABSORPTION 

305700 

RDNCHcREF<N)  + TDr<N ) ARDNCNATDF( N >/DEND 

306200 

TOFCN«TDFCNATDF< N;/DEND 

306500 

X30 

IF  <NCUDUD<N j , EG, 0 . OR. PID.GE, X, E-4 ? GO 

307000 

CASE  3.  HERUV  CLOUD  COVER 

307X00 

TDFCH»U , 

307200 

RDNCNtsU  , 

307300 

TAU»0 , 

307^00 

X40 

CONTINUE 

307500 

SAVE  PARTIAL  SUMS 

307600 

EUPC(H;=EUPCH 

307700 

EDNC<N>seoHCH 

307800 

TDFC<N;sTOFCN 

307900 

RDHCOO“RDNCr< 

303000 

xox 

CONTINUE 

308050 

c 

308060 

C fiDDIHG  GROUND  LHVER  NOT  INCLUDED 

308070 

C 

3X0600 

CA  AlK 

3X0700 

CAAAAAEDRM  DDTTDM  COMPOSITE  LfiVER  (flODITIDN; 

3X0800 

CAAA AA 

3X0900 

DO  XX8  Ns2,NG 

31X000 

1 i s N G + 1 - N 

3XXX00 

DENO  = X,U  - RUPCNAREF<MI 

3X1400 

EUPCNtiEUP<M  ^ + \EUPCNtEDN<M;ARUPCN;ATDF<I 

3X2000 

IF<H.EG,X>  GO  TO  XX9 

3X2X00 

LcN-X 

3X2200 

RUPCHsRErfM^TTDF<M^ATDF(M)ARUPCN/DEND 

5X2700 

DENDkX,0-RDNC(L;ARUPCN 

3X3X00 

PEFUP  SI EUPCNtEDNC^L  ; ARUPCN>/DEND 

3X3500 

PEFDN  si ednc<  l ;+eupchardnc(l;)/deno 

3X3900 

60  TO  X20 

3X4000 

-.19 

PEFUPsEUPCN 

3X4X00 

PEFDNsU . 

3X4200 

O A A A A 

3X4300 

FE<M^=FE(M;+CKLAMA<PEFUP-PEFDH^ 

3X4700 

1X3 

CONTINUE 

3X4800 

xOO 

CONTINUE 

3x4900 

200 

CONTINUE 

3X49X0 

C 

3X4920 

C S 

AVE  STRATOSPHERIC  FLUXES  NOT  INCLUDED 

3X4930 

c 
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REPRODUCiBiLrry  op  the 
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JliOiOO 

IF  (BERFUQ)  THEN 

iiOSOO 

ftDNCN  « REF<N/  f 

TDF(N;ATDF ( N ^ ARONCNAOENO 

110300 

TDFCN  « TDFCr4ntTDr(  N )J^DENU 

il0^00 

END  IF 

110500 

IF<NCC.NE,U  ,UR. 

p:o,u7,i,oe-4;  then 

iloeoo 

VOFCN  s 0,0 

110700 

RDNCH  !=  0,0 

110000 

TBU  K 0,0 

110900 

END  IF 

111000 

EUPCvN;  u EUPCN 

XlllOO 

E0NC<N;  :j  EDNCH 

iliSOO 

TDFCCHj  » TDFCN 

111300 

RDNCCH;  Js  RDNCN 

111350 

iOl 

CONTINUE 

111400 

DO  118  H a NQ-1,1 

111500 

DEND  = l,U/<i.0- 

RUPCN)«£REF<M  > > 

lll&OO 

EUPCN  = EUP<M;  ♦ 

C (EON<rU*RUPCN+EUPCN > x TOF(M^xDENU 

111700 

IF  (M,NE,i^  THEN 

111800 

RUPCN  = REFCMJ 

+ (TDF(M;ATDF<M^XRUPCN)«cOENU  ) 

111900 

U ii  H-1 

ilEOOO 

DENU  s 1, /<i.-ROHCCL;ARUPCN^ 

IIEIOO 

PEFUP  c (.EUPCN 

r EDNC<LMRUPCN)  ADENO 

IIEEOO 

PEFON  s ^EUPCN 

+ EDNC<L > ARONC<U 4 >«DENU 

115300 

ELSE 

115400 

PEFUP  n EUPCN 

115500 

PEFDN  s 0,0 

115600 

ENDIF 

115700 

fe^m;  s FE<M,J  t 

(.PEFUP-PEFDN  ^ wCLKBM 

115800 

ilS 

CONTINUE 

115900 

100 

CONTINUE 

113000 

500 

COr^TINUE 
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APPENDIX  H 

CONNECTION  NETWORK  SIMULATION  TOOLS 

H*1  SUMMARY 

Two  computer-based  tools  were  developed  as  an  aid  to  the  study  of 
the  various  Connection  Networks • The  first  was  a functional 
simulator.  This  functional  simulator  supported  evaluation  of  the 
Benes,  the  single-layer  Omega/  and  the  simple  double-layer  Omega 
networks  described  in  Appendix  B.  The  networks  could  be  exercised 
in  a number  of  modes  including  random  inputs  and  p-ordered  inputs. 
Section  H.2  below  discusses  these  capabilities  in  more  detail. 

The  second  tool  was  a stochastic  analyzer  (see  Section  H.3).  This 
tool  used  the  probability  of  addresses  occurring  and  the  probabi- 
lity of  requests  occurring  within  the  network  to  predict  blockage 
within  the  network.  Although  this  approach  precluded  actually 
observing  where  specific  blockages  would  occur  for  a particular 
input  situation/  the  tool  was  felt  to  be  necessary  because  it 
would  be  unreasonable  to  run  all  possible  input  combinations.  The 
stochastic  analyzer  was  used  to  evaluate  both  the  single-layer 
Omega  network  and  the  double-layer  network  which  included 
inter-layer  connectivities. 

It  was  noted  in  Appendix  B that  both  tools  gave  comparable  results 
when  run  on  the  same  cases.  This  correspondence  gave  confidence 
in  the  results  obtained  using  these  tools. 

H.2  CONNECTION  NETWORK  FUNCTIONAL  SIMULATOR 

H.2.1  Model 

The  CN  simulator  is  designed  to  simulate  a CN  in  which  requests 
propagate  through  at  the  speed  of  transmission  delay  in  cable  and 
combinatorial  logic/  after  which  the  path  is  locked  up  for  the 
duration  of  the  EM  cycle.  After  an  EM  cycle,  the  nodes  involved 

in  this  path  may  be  unlocked  if  they  are  not  involved  in  another 
EM  request. 

In  addition  to  nets  with  the  connectivity  of  Benes  networks  and 
Omega  networks  (see  Appendix  B) , there  are  options  on  the  amount 
of  redundant  paths  supplied.  There  can  be  twice  as  many  ports  on 
the  processor  side  as  there  are  processors/  or  there  can  be  just 
512.  The  EM  module  ports  can  be  spread  across  the  entire  1024 
ports  on  that  side,  or  they  can  occupy  the  first  521.  The  simu- 
lator basically  has  a 1024-wide  network  of  2 x 2 switches. 

The  number  of  CN-clock  cycles  per  EM  cycle  time  can  be  adjusted 
from  1 to  9 by  an  input  parameter. 
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Each  simulated  processor  has  a queue  of  up  to  six  memory  requests. 
The  Nth  entry  in  this  queue  may  be  either  a set  of  "S"  random  EM 
module  numbers,  with  512-S  of  the  processors  having  null  requests, 
or  the  entry  may  be  a p-ordered  or  a p-q-ordered  vector  of  EM 
module  numbers,  with  512-S  of  the  processors  having  their  requests 
nulled  before  the  program  starts,  S is  an  input  parameter  ranging 
from  0 to  300,  or  equal  to  512, 

The  four-digit  seed  of  the  random  number  generator  is  included  in 
the  set  of  input  parameters, 

H,2,2  Simulator  Controls 

The  input  commands  accepted  by  the  functional  simulator  are  listed 
in  Table  H,1  below.  Some  of  these  inputs  are  optional  and  have 
default  values  as  indicated. 


Table  H.l 

CN  Functional  Simulator  Input  Commands 
Command Description 

Fn  Type  of  Network  where  n is  the  sum  of 

0:  if  a 19-level  Benes  Network 

1;  if  a 10-level  single-layer  Omega 
networ  k 

2s  if  a 10-^evel  double-layer  Onega 

network  with  alternating  priorities 
4s  if  processor  M is  attached  to 
input  port  2M 

8s  if  EM  module  N is  attached  to  output 
port  2N  up  to  511  with  the  other  9 
attached  output  ports  1,  3 5,  ...  17. 

(If  no  F command,  PO  used  as  default) 


Command  Description 

ftn  Algorithm  to  be  used  within  each  2x2  node  where  n iss 

Os  the  node  gives  priority,  in  case  of  conflict, 
to  the  lower-numbered  ("upper")  input  on  all 
one-sheet  (single  layer)  cases,  and  to  give 
priority  to  the  higher -numbered  ("lower")  input 
for  the  second  sheet  in  a double-layer  Omega 
network. 

Is  the  node  sets  a straight-through  connection  in 
the  case  of  conflict. 

2;  the  node  alternates  the  priorities  between 
upper  and  lower  input  ports  on  alternate  CN 
clocks.  If  this  mode  is  chosen,  it  is  recommended 
that  the  number  of  CN  clocks/BM  clock  be  odd 
(see  Tn  command  below). 
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Table  H*1  (Continued) 

(If  no  A command,  AO  is  used  as  default*) 

Snnn  Command  which  causes  all  but  ”nnn”  of  the  512 

entries  in  each  queue  position  in  the  processor  to 
be  erased*  The  choice  of  which  entries  to  erase 
is  random. 

( If  no  S command,  S512  is  used  as  default.  This 
corresponds  to  no  erasures.) 

Tn  Command  which  sets  ”n“  cycles  of  the  CN  clock  for 

each  EM  access  time. 

( If  no  T command  TO  is  used  as  default.) 

BR  An  optional  command  that  signals  the  "bit- 

reversal"  of  processor  number  to  TN  pore  number. 
That  is,  if  BR,  then  proc.  1 goes  to  port  256, 
proc.  2 goes  to  port  128,  proc.  3 goes  to  port 
384.  That  is,  proc.  00000011  goes  to  port 
11000000.  Processor  no.  00010111  goes  to  port  no. 
11101000,  etc. 

nnnn  A four-digit  number  sets  the  seed  for  the  random 

number  generator. 


Pnnnmmm  Sets  a p-ordered  vector  into  the  next  entry  across 

all  processor  queues.  The  entry  has  an  offset  of 
"nnn"  and  a skip  distance  of  "mmm". 

Qaaassskkkxxxqqq  Sets  a p-q-ordered  vector  into  the  next 

entry  across  all  processor  queues.  "aaa"  is  the 
offset  to  the  start  of  the  first  vector  piece,  sss 
is  the  skip  distance  within  pieces,  kkk  is  the 
length  of  each  piece,  xxx  is  the  number  of  ele- 
ments omitted  if  the  first  piece  is  shorter  than 
kkk,  qqq  is  the  skip  between  the  end  of  one  piece 
and  the  beginning  of  the  next. 

R Sets  a vector  of  512  random  requests  into  the  next 

entry  in  all  the  processor  queues.  (The  seed  for 
the  random  number  generator  should  precede  this 
command . ) 

Lnnn  This  command  imposes  a limit  on  ^the  number  of  CN 

cycles  through  which  the  simulation  will  run. 
Termination  will  be  after  "nnn+1"  CN  cycles. 

(If  no  L command,  L047  is  used  as  default.) 
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Warning:  Although  the  input  is  free/form,  in  that  the  sequence  of 

commands  does  not  matter  and  any  number  of  intervening  blanks  are 
al lowed r each  number  must  follow  its  command  without  any 
intervening  blank,  and  must  consist  of  exactly  the  correct  number 
of  digits. 

Following  all  commands,  any  character  (such  as  ”X")  that  is  not  ?, 
E,  N,  D,  or  the  first  character  of  any  valid  command,  will  term!-* 
nate  the  input.  The  rest  of  the  card  can  then  be  used  for  comment 
which  will  be  printed  out  on  the  first  line  of  output. 

H.2.3  Simulator  Output 

Figures  H.l,  H.2,  and  H.3  are  examples  of  three,  typical  CN  Func- 
tional Simulator  outputs.  These  examples  happen  to  use  p-q- 
ordered  vectors  as  inputs  with  piece  lengths  of  31,  100,  and  30. 
The  cases  were  taken  from  the  explicit  and  implicit  aero  flow 
code.  Two  of  these  cases  are  in  mesh  si^es  as  exhibited  in  the 
listings  supplied  by  NASA,  and  one  of  the  cases  exhibits  the  full 
size. 

The  first  line,  of  the  printout  '?hich  begins  with  “TEND”  prints 
the  input  commands  as  previously  described.  For  example.  Figure 
H.l  shows  (on  the  first  line) : 

T2  (2  CN  Clocks  per  EM  access  time) 

BR  (Bit  reversal  of  processor  number  to  CN  port  number) 

F14  (Double-Layered  Omega  Network  w.  alternating  priori- 
ties (2)  + processor  M attached  to  port  2M  (4)  -f  EM 
module  N attached  to  port  2N.(8)) 

Q047  1 31  1409  (p-q-ordered  vector  with 

047  offset 

1 skip  distance  within  pieces 

_31  length  of  each  piece 

^1  number  of  elements  omitted  if  the  first  piece 

is  shorter  than  31 

409  skip  between  the  end  of  one  piece  and  the 
beginning  of  the  next. 

The  next  several  lines  of  the  output  summarize  the  simulation 
conditions  specified  by  the  input  commands. 

The  remaining  output,  summarized  below,  is  printed  at  each  network 
layer  at  each  CN  clock. 

1st  line:  Number  of  items  left  in  processor  queue  before  begin- 
ning. Does  not  include  items  picked  up  by  bumping  the 

processor  queue  pointers. 

(Text  continued  on  page  H-13) 
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Figure  H.l  CN  Simulator  Output,  First  Example 
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Figure  H.l  CN  Simulator  Output,  First  Example  (Cont'd) 
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Figure  H.2  CN  Simulator  Output,  Second  Example 
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Figure  H.3  CN  Simulator  Output,  Third  Example 
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Figure  H.3  CN  Simulator  Output,  Third  Example  (Cont'd) 
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2d  and  3d  line:  Report  of  distribution  of  BM  conflicts  (pileups). 

(number  of  pileups)  x (length  of  pileup),  from  lengths  of  0 
through  15.  Any  EM  module  with  a pileup  of  10  or  more  will 
have  a line  stating  its  module  number  and  the  size  of  the 
pileup. 

4th  line:  ”On  the  nth  cycle** 

5th  line:  **There  were  sss  successes  in  rrr  requests”.  For  ver- 

sion A,  the  number  of  successes  listed  in  the  first  report  is 
for  the  first  layer;  the  number  of  successes  listed  in  the 
second  report  for  the  nth  cycle  is  the  total  for  both  layers. 

Next  32  lines:  512  entries,  one  for  each  processor.  At  each 

entry  we  find  **-*'  if  no  request  was  made,  otherwise  the  EM 
module  number  of  the  request,  prefixed  by  *****  if  the  request 
was  granted,  by  **EM**  if  the  EM  cycle  is  still  running,  so  the 
path  is  locked  up, 

H.3  CONNECTION  NETWORK  STOCHASTIC  ANALYZER 

The  Connection  Network  Stochastic  Analyzer  is  used  to  compute  the 
probabilities  of  input,  output  and  blockage  for  each  switch  across 
the  connection  network  (CN).  These  computations  are  then  used  to 
determine  blockage  at  each  level,  and  finally  to  determine  total 
blockage.  This  tool  was  not  developed  to  test  the  performance  of 
the  CN  under  specific  conditions.  Rather,  the  question  raised  was 
what  would  the  effect  of  the  CN  be  on  the  average.  An  initial 
assumption  was  made  that  the  inputs  to  be  evaluated  would  be 
random  permutations  of  the  destination  addresses.  Under  this 
assumption,  no  blockage  would  occur  due  to  simultaneous  reference 
to  the  same  destination.  Although  such  a situation  will  actually 
occur,  it  is  a misleading  situation  when  studying  the  effect  of 
the  network  itself.  The  functional  simulator  did  allow  consider- 
ation of  such  simultaneous  reference  situations. 

H.3.1  Model 

The  Stochastic  Analyzer  was  implemented  to  study  the  single-layer 
Omega  network  and  the  double-layer  Omega  network  with  interlayer 
paths  at  each  node.  An  example  of  such  a network  is  shown  in 

Figure  H,4.  In  this  figure  8 processors  and  11  memories  are 
connected.  For  the  purposes  of  this  model,  the  11  extended  memory 
modules  are  **spread'*  is  evenly  as  possible  across  the  output  ports 
of  the  net  (i.e.  with  16/11  steps  between  each  connection.)  This 
mapping  should  be  equivalent  (although  it  is  not  the  same)  to  some 
of  the  mappings  discussed  in  Appendix  B. 

H.3. 1.1  Input  Probabilities 

Since  only  random  permutations  are  considered,  each  destination 
port  address  (for  those  ports  with  memory  modules  attached)  occurs 
with  equal  probability.  As  pointed  out  earlier,  there  may  not  be 
as  many  memory  modules  connected  as  there  are  output  ports  on  the 
network.  As  a result,  the  probability  that  a specific  bit  in  the 
destination  address  = 0 or  1 is  likely  to  vary  from  bit  to  bit. 
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For  example,  in  Figure  H*4,  the  probability  that  the  high  order 
bit  of  the  destination  address  « 1 is  5/11  and  the  probability 
that  that  bit  » 0 is  6/11*  This  probability  affects  the  proba- 
bility of  an  address  occurring  on  one  or  the  other  outputs  of  a 
switch. 

H.3.1.2  Probability  Computations 

The  computations  performed  by  the  analyzer  are  based  on  the 
probability  of  occurrence  of  each  possible  input  combination  to 
each  switch  or  node  in  the  network.  For  example,  consider  the 
switch  marked  A in  Figure  H.4.  For  this  example,  assume  that  the 
network  is  a single-layer  Omega  network.  The  probability  of 
blockage  in  that  switch  is  the  probability  that  the  inputs  are 
either  both  0 or  both  1 simultaneously.  If  P(  INPUT)  is  the  proba- 
bility of  an  input  request  occurring  and  if  P( 0-BIT),  P(  1-BIT), 
P(2-EIT),  P(3-BIT)  are  the  probabilities  of  the  high  order  through 
low-order  bits  of  the  destination  address,  then  the  upper  input  to 
A exists  with  the  probability: 

P(A-UPPER)  « P( INPUT)  X P(0-BIT«1)  (H.l) 

Similarly,  for  the  lower  input 

P(A-LOWER)  P(  INPUT)  x P(0-BIT«1)  (H.2) 

Then,  the  probability  of  blockage  in  switch  A can  now  be  deter- 
mined. 

P(A-UPPER=1)  = P(  INPUT  X P(0-BIT*=1)  X P(L-BIT=1)  (H.3) 

P(A-UPPER=0)  = P( INPUT)  X P(0_BIT=1)  X P(1-BIT=0)  (H.4) 

P(A-LOWER«l)  ==  P(  INPUT)  X P{0-BIT«1)  X P(1-BIT==1)  (H.5) 

P(A-LOWER==0)  = P(  INPUT)  X P(0-BIT«1)  X P(1-BIT=0)  (H.6) 

P(BLOCK-IN-A)  = P(A-UPPER«1)  X P{ A-LOWER*! ) + 

P(A-UPPER=0)  X P(A-LOWER=^0)  (H.7) 

Substituting  known  values: 

P( INPUT)  “ 1 (assume  all  inputs  active) 

P(0-BIT«1)  « 5/11 
P(1-BIT=0)  « 6/11 
P(1-BIT=1)  « 5/11 

Then: 

P(BLOCK-IN-A)  « (5/11  X 5/11)  x (5/11  x 5/11)  t 5/11  x 6/11) 

X (5/11  X 6/11)  ==  .104 

Using  similar  techniques,  the  probability  of  outputs  occuring  on 
the  outputs  of  switch  A can  be  determined.  This  sort  of  computa- 
tion can  then  be  carried  on  through  the  network,  taking  into 
account  the  probability  of  blockages  and  the  probability  of  the 
corresponding  address  control  bit. 
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H.3.2  Analyzer  Controls 

The  user  inputs  the  number  of  processors  and  memory  modules  in  the 
system  as  well  as  the  number  of  switch  levels  (up  to  10),  the 
number  of  input  connection  points,  the  number  of  active  processors 
( total  processors),  and  the  number  of  layers  (1  or  2)  in  the 
network.  Using  this  information  the  analyzer  builds  a table 
representing  the  connection  network.  This  table  provides  for 
processors  to  be  mapped  onto  input  ports,  outputs  from  one  level 
mapped  onto  inputs  of  the  next  level,  and  switch  outputs  mapped 
onto  memory  modules.  Each  switch's  input  probability  is  used  to 
compute  its  own  output  and  blockage  probabilities. 

H.3.3  Analyzer  Output 

When  the  calculations  fo«-  each  switch  are  completed,  a listing  is 
prepared  which  fully  describes  the  network  analyzed.  All  of  the 
user-input  information  is  printed  as  well  as  processor  and  memory 
mod  mappings.  Total  blockage  and  blockage  for  each  level  is 
printed,  as  well  as  each  switch's  output  probabilities. 

Figure  H.5  shows  an  example  of  an  additional  output  which  summar- 
izes the  results  of  a number  of  runs.  The  output  for  each  run 
specifies  the  number  of  processors,  of  memory  modules  and  of  ports 
in  the  network  being  evaluated.  The  number  of  active  processors 
identifies  the  average  number  of  processors  actively  presenting 
requests  to  the  network.  Cumulative  blockage  probability  is  the 
probability  that  any  request  made  is  blocked  somewhere  within  the 
network.  The  number  of  inputs  per  switch  identifies  which  type  of 
network  was  run.  A 2-input  switch  is  used  on  the  single-layer 
Omega  network.  A 4-input  switch  is  used  on  the  double-layer  Omega 
network  with  interlayer  communication.  The  line  identified  as 
Probability  of  Blockage  summarizes  the  cumulative  blockage  at  each 
level  through  the  network  from  the  processors  (on  the  left)  to  the 
memory  modules  (on  the  right) . 
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Figure  H 5 Stochastic  Analyzer  Sample  Output 
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Figure  H 5 Scochastic  Analyzer  Sample  Output  (Cont’d) 
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Figure  H.5  Stochastic  Analyzer  Sample  Output  (Cont'd) 
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APPENDIX  I 

BENES  ANP  OMEGA  NETWORKS  FOR  PLOW  MODEL  PROCESSING* 

I a INTRODUCTION 

Parallel  processing  machines  gain  time  at  the  expense  of  addition- 
al processing  elements.  However#  parallelism  entails  processor 
access  problems.  The  major  assumptions  of  the  NASF  Plow  Model 
Processor  aret 

1)  There  are  512  processing  elements  and  521  extended  memory 
modules . 

2)  Some  hybrid  of  a Benes  or  Omega  network  is  used  to  connect 
processor  elements  to  EM  modules  and  processing  elements  to 
processing  elements  (See  Figure  laA  and  laB). 

Roughly#  the  more  processing  elements#  the  faster  the  machine  can 
run#  given  a program  which  exhibits  a large  degree  of  parallelism. 
If  there  is  a prime  number  of  memory  modules  — 521  is  prime — then 
corresponding  column  elements  of  a p-ordered  vector  are  stored  in 
different  extended  memory  modules,  making  it  particularly  easy  to 
access  a column  at  a time  (see  Figure  1.2).  However#  in  assuming 
521  EM  modules#  we  presume  that  matrices  are  to  be  stored  across 
the  EM*s.  It  may  be  beneficial  to  be  a slight  bit  heretical  and 
ask  whether  matrices  stacked  into  a single  EM  might  not  be  more 
effective  in  executing  block  transfers  to  local  memory.  It  is  to 
be  remembered  that  a single  processor  will  roost  often  want#  say 
VECT(D#  VECT(I+D#  and  VECT(I-l)#  which  may  be  stored 
concurrently  in  local  memory.  This,  however#  seems  to  be  mainly  a 
software  problem. 

The  choice  of  a Benes  or  an  Omega  network  is  a pragmatic  one  based 
on  required  hardware  and  expected  transmission  time.  (See  the 
chart  on  pg.  109  of  Ref.  1).  Ultimately#  we  settle  for  Benes  and 
Omega  networks  because  they  appear  to  be  the  most  efficacious 
solution  presently  available.  While  Benes  (2#  3#  4)  has  shown 

that  for  the  network  which  bears  his  name#  there  exists  a non- 
blocking  control  pattern  for  every  arrangement  of  inputs  to 
outputs#  practically  speaking#  computation  time  is  prohibitive. 
Thus#  the  concept  of  distributed  control  arises;  this  concept 
works  especially  well  with  an  Omega — since  at  the  ith  level  in  the 
network#  there  is  a relatively  simple  mapping  between  the  ith  most 
significant  •bits  of  two  or  more  addresses  and  the  state  of  the 
switch. 


*Originally  submitted  in  September#  1978. 
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storing  a p = 5 Matrix  in  a Prime  Number  of  EM  Modules 


Figure  1.2 

Storing  a p=5  Matrix 
in  a Prime  Number  of  EM  Modules 
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1.2  ANALYSIS  OF  TWO- LEVEL  OMEGA  NETWORK  WITH  INTER- LAYER 
CONNECTIONS 

The  object  of  the  two-layer  Omega  node-level  analysis  is  to  obtain 
the  blocking  probability  at  each  level  in  the  network  from  its 
input  probabilities.  The  output  probabilities  can  then  be  calcu- 
lated from  the  input  probabilities  and  blocking  that  goes  on  ^^t 
each  switch.  These  outputs  are  then  re-ordered  by  the  connectiv- 
ity of  the  network,  and  they  become  the  inputs  for  the  next  level. 

An  important  fact  in  this  analysis  is  that  each  switch  in  a given 
level  has  the  same  set  of  input  probabilities.  Thus,  the  probabil- 
ity of  a block  at  one  switch  becomes  the  probability  of  blocks  at 
N switches.  We  assume  an  unpacked  Omega  (with  N processors 

attached  to  2N  input  ports),  so  that  the  inputs  to  level  one  are 
all  at  the  A-port  of  the  first  layer  node.  (Figure  1.3).  There 
are  then  two  possible  inputs  since  it  is  equally  likely  that  the 
address  bit  will  be  a one  or  a zero.  These  bits  determine  the 
switching  operations  performed  by  the  node  on  the  address  under 
the  switching  rules.  It  is  clear  that  on  the  first  level  of  an 

unpacked  network,  there  will  be  no  blocks,  and  that,  furthermore, 
the  second  layer  is  not  used.  The  topology  of  the  network  implies 
that  there  are  nine  possible  input  combinations  to  the  second 
level,  each  of  which  has  an  associated  probability.  On  the  second 
level,  the  fact  that  1;^=  address,  Iq  = address  (where  I^  and  Ig 
are  inputs  A and  B)  is  now  a possible  combination  implies  that  the 
second  layer  is  now  used,  although  there  are  still  no  blocks  on 
this  layer.  There  are  49  possible  input  combinations  to  the  third 
level.  Blockage  is  now  possible  since  there  can  be  three  inputs 
to  one  two-layer  switch  pair.  On  the  fourth,  and  all  subsequent 
levels,  there  are  81  input  types.  This  is  basically  base  three  in 
four  places  where  the  three  characters  0,1  and  blank  are  permuted 
over  I;^,  Ig,  la  and  Ig.  Table  1.4  gives  the  input,  output  and 
blocking  probabilities  for  these  first  four  levels  done  in  the 
hand  simulation. 

There  are  two  concepts  which  should  be  understood  concerning  the 
evaluation  of  the  network  through  each  succeeding  stage.  They  are 
a)  increasing  randomness,  and  b)  decreasing  density.  While  initi- 
ally most  of  the  addresses  are  on  the  lower  layer,  conflicts  on 
the  lower  layer  tend  to  send  more  addresses  to  the  upper  layer. 
In  equilibrium,  both  layers  will  be  equally  occupied.  Now  a nec- 
essary condition  for  a block  in  the  network  is  that  there  be  two 
inputs  on  one  layer,  and  one  on  the  other.  One  might  think, 
therefore,  that  maximum  blocking  will  occur  when  the  first  layer 
has  twice  as  many  addresses  as  the  second  layer,  but  since 
blocking  is  symmetric  between  layers,  maximum  blocking  is  expected 
to  occur  when  the  two  layers  are  equally  dense,  i.e.  when  the 
system  is  completely  randomized.  On  the  other  hand,  the  fact  that 
blocking  implies  a decreased  density  of  addresses  as  the  addresses 
are  blocked  means  that  the  number  of  blocks  should  decrease  as  re- 


Figure  1.3 
A node  of  level  1 


quests  progress  through  the  network*  However,  since  the  actual 
number  of  blocks  is  small,  this  effect  will  be  small  at  first,  but 
will  become  more  important  as  the  system  tends  towards  equilib- 
rium. Thus,  we  would  expect  the  blocking  probability  to  increase 
initially  due  to  the  effect  of  randomization,  and  then  to  decrease 
due  to  the  effect  of  decreasing  density.  However,  it  is  not  clear 
on  which  level  the  turning  point  will  occur. 

When  it  was  realized  that  more  blocks  would  occur  on  some  levels 
than  on  others,  a search  began  for  the  best  way  of  adding  a small 
amount  of  hardware  to  improve  performance.  First,  it  was  felt 
that  one  probably  not  want  to  take  any  levels  off  of  the  second 
layer  (with  the  exception  perhaps  of  the  first  level),  since  the 
loss  in  efficieny  at  that  level  would  be  greater  than  any 
"marginal  gains"  at  any  point  in  a third  layer.  Secondly,  it  is 
clear  that  once  a new  layer  is  initiated,  it  must  be  continued  to 
the  destination  ports  if  the  addresses  are  not  to  be  inj'.cted  back 
into  the  lower  layers. 

This  led  to  the  concept  of  the  three-layer  network  by  asking  how 
such  a system  might  "grow".  Indeed,  there  are  some  similarities 
between  the  transposition  network  and  the  corpus  callosum  (which 
unites  the  two  hemispheres)  of  the  human  brain.  However  it  seems 
somewhat  deceptive  to  think  in  terms  of  layers,  for  each  switch 
pair  may  be  reduced  to  a planar  circuit  with,  say,  four  inputs 
being  mapped  to  four  outputs.  As  described  in  Chapter  5,  each  of 
these  input  and  output  sets  are  composed  of  twelve  or  more  wires, 
at  least  nine  of  which  control  the  switch  settings  for  the  various 
levels  of  the  network;  the  other  three  or  more  wires  may  play 
special  parts  in  the  local  control  of  the  switch.  The  frames  of 
data  may  follow  the  'net-code*  through  the  network  to  be  stored  in 
buffers  at  the  terminal  end. 

1.3  SKIP  DISTANCE  ANALYSIS 

When  a p-ordered  vector  -s  stored  across  extended  memory,  cor- 
responding column  elements  are  stored  in  modules  (o+pi)mod521  as 
shown  in  Figure  1.2,  where  o is  the  offset  and  i is  the  row 
number.  When  each  processor  gets  a succeeding  row  element,  i 
becomes  the  processor  number.  This  is  particularly  important  in 
lock-step  operation,  but  is  also  relevant  in  the  early  stages  of 
any  loop. 

Results  of  hand-simulations  which  were  performed  for  8 x 11  un- 
packed Benes  and  Omega  networks  are  summarized  in  Tables  1.1,  1*2 
and  1.3.  It  becomes  clear  from  these  charts  that  these  networks 
are  symmetric  with  respect  to  skip  distance,  l.e.  there  is  a 
correspondence: 


skip  1 to  skip  10 
skip  2 to  skip  9 
skip  3 to  skip  8 
skip  4 to  skip  7 
skip  5 to  skip  6 
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Level  1 

OaOa» 

ObObf 

Blocking 

h 

0% 

h 

h 

*A 

0 

0 

AA 

0 

0 

Level  2 

** 

9/16 

9/16 

0% 

6/16 

6/16 

*A 

0 

0 

AA 

1/16 

1/16 

Level  3 

*»  9604/16384  9605/16384  1.46% 

A*  5096/16384  5096/16384  (7.47  blocks) 

*A  392/16384  392/16384 

AA  1292/16384  1292/16384 


Level  4 

Not  completed 


1.79% 

(8.99  blocks) 


Table  I.l 

Summary  of  Node-Level 
Hand  Analysis 
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Table  1.2 


Skip  Distance  Analysis  for  OBMGA  Network 


Offset 

123456789  10  Ave. 
141413223  1 2.6 

100010011  0 0.4 

122222222  1 1.8 

131112231  1 1.8 

0000000000  0 
000000000  0 0 

223112131  1 1.8 

122222222  1 1.8 

100010011  0 0.4 

232141404  0 2.4 

*Assured  by  Symmetry 


Table  1.3 

Skip  Distance  Analysis  for  BENES  Network 


Offset 

0123456789  10 
Skip  10000000000  0 

I 20213220100  1 

i 30000000000  0 

41111112121  2 

50000110110  0 

60011000000  2 
71212121111  1 

80000000000  0 
92022312010  1 

10  0000000000  0 


Ave. 

0 

1.1 

0 

1.3 

0.4 

0.4 

1.3 

0 

1.3 

0 


Skip 


' 8 0 
» 9 0 
10  3 
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1,4  TOWARD  A GENERAL  ANALYSIS  OP  TRANSPOSITION  NETWORKS 


In  its  most  abstract  formulation,  a system  such  as  a transposition 
network  can  be  described  in  terms  of  its  states  in  a stochastic 
process.  In  an  unpacked  one  layer  Omega  network,  there  are  n2>^ 
switches,  where  n is  the  number  of  levels.  Each  switch  can  occupy 
one  of  nine  possible  states,  two  of  which  are  blocking,  and  seven 
non-blocking.  Much  like  a system  of  N pennies,  which  can  take  on 
2*^  different  statc-s,  an  n-leveled  transposition  network  can  take 
on  states  of  which  are  non-blocking.  These  very  large 

numbers  give  us  every  possible  combination  of  switch  configur- 
ations which  the  network  can  occupy. 

One  problem  with  such  an  analysis[21,  is  that  not  every  state 
corresponds  to  a physically  realizable  configuration.  In 
particular,  there  will  be  states  for  which  no  continuous  path  can 
be  drawn. 

One  might  then  come  to  take  the  path  rather  than  the  state  of  a 
set  of  switches,  as  our  unit  of  analysis.  In  an  Omega  network 
there  is  one  and  only  one  path  by  which  any  given  input  can  reach 
any  given  output.  (In  a Benes,  there  are  2**  such  paths,  one  for 
each  switch  from  the  middle  level.)  Now  assume  that  there  are  2^ 
inputs — one  for  each  switch  — and  2*^+  r outputs  — where  r 
includes  additional  outputs  plus  one  output  corresponding  to  a 
null  request.  Then  there  are  order  2^n  states  in  the  sample 
space,  each  described  by  2*^  input-output  pairs. 

The  problem  then  becomes  one  of  obtaining  the  blocking  probability 
for  each  of  these  states.  This  must  involve  the  structure  of  the 
network  itself.  One  can  note,  however,  that  blocking  in  an  CXnega 
is  a function  of  input  pairs,  for  on  any  level  only  two  inputs  may 
share  the  same  switch.  A mathematical  algorithm  for  determining 
whether  any  given  input  pair  results  in  a block  is  given  in  the 
following  section.  Part  B.  It  is  noted  here  that  such  an  algor- 
ithm requires,  at  most,  a comparison  of  each  of  2*^  input-output 
pairs  for  each  of  n levels.  Thus,  for  order  2^^  states,  there  ate 
order  n2^n  or  N4log2N  comparisons  that  must  be  made  to  completely 
determine  the  blocking  probabilities  for  all  possible  states. 

This  number  may  well  be  dishear teningly  large  for  practical 
results,  even  if  it  need  be  done  only  once  for  simulation 
purposes.  Says  Benes:  (11 

In  most  congestion  problems,  it  is  easy  enough  to  construct 
(say)  a Markov  process  that  is  a probabilistic  model  of  the 
system  of  interest.  But  it  is  dififcult,  because  of  the 
large  number  of  states  and  complexity  of  the  structure,  to 
obtain  either  analytic  results  or  fast  reliable  procedures. 
This  circumstance  has  been  a major  obstacle  to  rpgress  in  the 
congestion  theory  of  large  systems.  One  of  its  consequences 
has  been  that  in  some  cases,  models  known  to  be  poor  rep- 
resentations of  systems  have  been  used  merely  because  they 
were  mathematically  amenable,  and  no  other  tractable  models 
were  available.  (pp.  1216-1217) 

I-IO 


In  another  place  he  talks  of  possible  "equivalence  relations" 
between  simple  models  and  more  complex  ones* 

The  following  is  actually  a model  for  determining  the  probability 
that  X random  assignement  from  N inputs  to  N outputs  will  be 
unique.  The  first  input  may  choose  any  of  the  N outputs.  The 
second  input  has  an  (N-l)/N  probability  of  choosing  one  of  the 
empty  ones.  The  ith  input  has  an  (N-i+l)/N  chance  of  choosing  one 
of  the  empty  ones.  For  x random  assignements#  the  probability  is 


S (%  in  N) 


Nl  „ 


that  such  a mapping  will  be  unique. 

The  above  formula  is  expected  to  be  related  to  the  probability  of 
obtaining  z successes  across  N ports  in  a packed  Omega  network. 

(This  suspicion  is  based  on  the  fact  that  there  is  one  and  only 
one  path  for  each  input-output  pair.)  For  small  x,  this  function 
is  presumed  to  increase  linearly,  but  for  larger  x and  2 2 seems 

to  increase  more  slowly  than  x.  Qualitatively,  unpacking  the 
network  corresponds  to  increasing  K,  which  increases  E{z  in  N)  . 

To  find  the  expected  number  of  successes  in  an  equilibrium  condi- 
tion, set  E(2  in  N)  equal  to  1/2  and  solve  for  z.  However,  for  a 
more  exact  and  more  complicated  procedure  for  obtaining  this 
result,  see  Section  1.7. 

1.5  PERMUTATION  GROUPS  AND  PARTITION  SETS 

Bene*s  proof  of  the  fact  that  a network  of  2Nlog2N  switches  is 
sufficient  to  ensure  the  rearrangeability  of  N inputs  to  N outputs 
was  published  in  1964  [3].  This  article  draws  heavily  on  group 

theory  and  the  concept  of  the  partition  of  the  set  (I,  2,  3, 

N)  . The  partition  of  a set  is  a finite  collection  of  disjoint 

sets  whose  union  is  the  given  set. 

A.  Consider  storing  the  Benes  transposition  network  of  2n-l 

levels  as  a matrix.  (Storing  the  n-level  Omega  network  is  a 
special  case  of  this.)  On  the  first  row,  store  the  vector  (0,  1, 

2,  N-1).  On  the  second  row,  s.ore  the  vector  (0,2, 1,3, 4, 6, 

,..,N-1),  taking  the  first  two  even  numbers,  then  the  first  two 
odd  numbers,  then  the  next  two  even  numbers,  until  all  N elements 
of  the  vector  are.  stored.  On  the  ith  row,  for  i less  than  n, 
store  the  first  2^*"^  even  element,  then  the  first  2^*"^  odd  ele- 
ments and  alternate  until  all  the  elements  of  the  set  (0,  1,  .... 

N-1)  are  used  up.  For  the  nth,  and  middle  row,  store  the  2*^“^ 

even  elements,  then  the  2^*"^  odd  elements.  (The  first  half  of  the 
N=16  Benes  network  is  shown  in  Table  1.5.)  For  row  i between  n 
and  2n-l,  store  just  as  the  2n-ith  row.  Now  to  compute  the  path 
that  a given  address  would  follow  in  the  absence  of  other 
addresses,  adopt  the  following  procedure. 

l-ll 


A 

4 


BENES  (F12) 


Skip 

Offset 

Successes 

Options 

Comments 

2 

3 

251 

2 

4 

246 

3 

4 

512 

Maq  ic 

3 

5 

512 

Maq  ic 

5 

6 

413 

5 

357 

382 

8 

357 

139 

128 

357 

230 

128 

357 

266 

BR 

256 

357 

233 

518 

357 

512 

Mag  ic 

520 

357 

512 

Magic 

OMEGA  (FI  3) 

Skip 

Offset 

Successes 

Options 

Comments 



199 

R 

Seed=0013 

— 

— 

211 

BR  & 2 

Seed=0013 

1 

0 

32 

Worst 

1 

1 

36 

2 

0 

63 

3 

0 

85 

4 

0 

78 

13 

357 

111 

128 

357 

279 

128 

357 

259 

BR 

210 

357 

307 

260 

0 

79 

Standard  deviation  of  N Count 


I deviate 
is 


Table  1*4 

Simulation  of  Skip  Distances 
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Experiments  on  the  CN  Simulator  confirm  this  hypothesis.  However, 
it  is  not  intuitively  clear  why  this  is  the  case,  nor  is  a strict 
correspondence  between  offsets  obvious.  In  part,  the  answer  lies 
in  the  fact  that  to  each  skip  distance  there  corresponds  a cyclic 
ordered  set  of  permutations  on  the  outputs  (0,  2,  4,  6,  8,  10,  12, 
14,  1,  3,  5).  This  set  is  itself  the  correspondence  set  for  skip 


For  skip  5,  the  ordered  set  is  (0,  10,  5,  8,  3,  6,  1,  4,  14,  2, 
12).  For  skip  6,  the  ordered  set  is  (0,  12,  2,  14,  4,  1,  6,  3,  8, 
5,  10).  These  sets  are  the  same,  save  that  they  ate  oppositely 
ordered) . 

A natural  question  which  arises  in  this  analysis  is  whether  there 
are  any  skip  distances  which  ate  particularly  bad.  Of  course,  the 
very  worst  case  will  be  a skip  of  521— which  corresponds  to  a skip 
of  zero — in  which  case  all  the  processors  will  attempt  to  access 
the  same  memory  module.  Other  than  this,  and  this  seems  to  be  a 
rather  important  fact,  the  greatest  number  of  blocks  occurs  in  an 
8x1^  Omega  for  skip  distances  of  one,  especially  those  with 
small  even  offsets.  Table  1.2  verifies  this.  Furthermore,  for 
all  the  trials  which  have  run  on  the  simulator,  skip  ® 1,  offset  = 
0 was  the  worst,  with  only  32  successes  in  512  trials.  The  reason 
for  this  is  as  follows;  the  second  level  of  an  unpacked  Omega 
will  account  for  blocking  of  half  the  inputs  if  inputs  from 
adjacent  nodes  wish  to  access  the  same  quadrant  of  the  network. 
Similarly,  if  adjacent  nodes  on  the  next  level  wish  to  access  the 
same  octant  of  the  network,  half  this  number  will  bo  blocked. 
This  halving  process,  as  the  addresses  are  "funneled  together", 
continues  until  they  are  half-way  through  the  network,  at  which 
point  they  are  "funneled  back  out"  to  their  separate  outputs.  For 
odd  offsets,  the  funneling  process  does  not  begin  until  the  third 
level.  For  larger  offsets,  the  mod521  configuration  of  the 
unpacked  Omega  tends  to  randomize  the  pattern. 

It  is  not  yet  clear  just  what  the  overall  relation  between  block- 
ages and  skip  distances  actually  is;  largely  this  problem  is 
irrelevant.  It  could  be  solved  empirically  by  running,  say  two 
hundred  simulations  picked  from  the  skip  distance  range  (0,260) 
for  offsets  (0,1)  and  plotting  a curve.  (Results  from  a few 
selected  simulations  are  offered  in  Table  1.4.)  One  would  expect 
some  kind  of  periodicity.  But  in  fact,  every  such  experiment 
which  has  been  run  for  an  Omega  network  has  resulted  in  a success 
rate  less  than  that  for  random  requests  (although  for  Benes  net- 
works skips  1 and  3 are  "magic"),  and  significantly,  it  seems  that 
the  bit  reversal  procedure  used  for  mapping  (see  Appendix  B for 
details)  is  tantamount  to  a 'pseudo-randomization*  of  sorts.  If 
this  randomization  is  hardwired,  no  skip  distance  should  be  parti- 
cularly bad. 


1.  Initialize  the  input  node,  column  j. 

2.  If  the  control  bit  for  the  ith  level  calls  for  ”go  straight**, 
then  the  node  for  level  i + 1 will  be  found  on  row  i+1, 
column  j* 

If  the  control  bit  calls  for  a **go  across**  and 

a,  j is  even,  then  the  node  for  level  i+1  will  be  found 
on  row  i+1,  column  j+1#  or 

b.  j is  odd,  then  the  node  for  level  i+1  will  be  found 
on  row  i+1,  column  j-1. 

3*  Let  j»node  number  and  increase  i. 

4.  If  i is  less  than  2n,  go  to  2. 

The  value  of  the  node  number  for  each  i describes  that  path  '^aken 
by  the  given  address. 

B.  Suppose  one  wishes  to  know  whether  two  given  input^ouput  pairs 
u-w  and  v-x  result  in  a block  for  an  Omega  network.  Let  U be  the 
set  (0,  1,  2,  ...,  N-1).  Let  i be  the  levql  number  for  i«0,  ..., 
n.  Partition  U sequentially  into  2^/2^“^,  2-by-2^’"^  matrices. 

Call  these  These  are  the  input  matrices..  Also  parti- 

tion U into  2V2^*"^,2^*"^-by-l  matrices.  Call  th^se  OL. , . OzJ^/2^’^^ 
These  are  the  output  matrices.  Then  for  all  (u,.  v,'w,  x),  if 
there  exists  a j such  that  u is  an  element  of  JZy  and  v is  an 
element  of  Ok  there  exisfs  a k such  that  w is  an  element  of 

OJ^  and  X is  an  element  of  0]^,  then  u-w  blocks  v-x  by  stage  i. 
Symbolically,  this  condition  can  be  written: 


]/x  \l*f  W Lii  6 Tja  V6.J. JAW  6.  O^aX  6 


The  partition  sets  for  n=4  are  given  in  Figure  1.4.  For  example, 
note  that  4-8  blocks  7-11  since  4 and  7 are  elements  of  1*  and  8 
and  11  are  elements  of  . So  4-8  blocks  7-11  by  level  2. 

Note  also  that  i is  not  unique  but  is  satisfied  for  any  i greater 
than  some  minimum  i.  To  make  i equal  to  this  minimum  i,  require 
that  u and  v be  from  different  columns  of  Xj- 

This  can  be  proven  by  considering,  for  the  ith  level,  the  ith  most 
significant  bit.  If  the  ith  most  significant  bit  of  the  two 
inputs  to  any  switch  are  the  same,  i.e. 

XXXX • • . 0 £ . . . Qf  XXXX . * . 1 ^ . . . 

XXXX . . . 0 ^ . . . XXXX . . . 1 . . . 

where  X's  and 's  represent  bits  that  may  assume  any  combination 

of  I's  and  O's,  then  there  will  be  a block.  Bits  which  are  more 
significant  than  i can  occur  in  all  possible  combinations,  but 
these  bits  determine  which  inputs  the  addresses  could  have  come 
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from.  Thus  a pairing  between  a set  of  output  addresses  and  a set 
of  input  addresses  can  be  formed.  While  this  is  not  a formal 
proof,  this  result  can  be  shown  corobinator ically  by  enumeration. 

C.  Suppose  one  wishes  to  know  the  mean  probability  that  two 
random  addresses  will  result  in  a block.  This  is  a function  of  a 
relation  between  two  inputs  which  is  called  their  ’distance*.  Con- 
sider input  0.  There  is  a 1/2  probability  that  it  will  be  blocked 
by  input  1 since  this  blocking  occurs  on  the  first  level.  There 
is  a 1/4  probability  that  it  will  be  blocked  by  inputs  2 or  3j 
this  would  occur  on  the  second  level.  For  inputs  4,  5,  6 and  7 
the  probabilitity  is  1/8  since  the  level  is  the  third.  In 
general,  the  distance  for  input  0 is  the  level  on  which  the  two 
inputs  could  block,  so  call  it  i.  ‘phen  the  probability  that  the 
inputs  will  block  on  level  i is  (1/2)^. 


Now  assumedly  there  is  a function  g(x,y)  of  any  two  input  numbers 
(x,y)  such  that  i=g(x,y).  Then  taking  the  average  value  of  the 
function/(^/i>yi)»/^^^^y/  the  probability  of  a block  is  obtained. 
But  for  this  to  be  a truly  random  distribution,  one  must  average 
over  both  x and  y as  shown  in  Equation  1.3. 

Now,  for  x*0,  ^ y 

X 000000000  0 0 0 0 0 0 

y 1 2 3 4 5 6 7 8 9 10  11  12  13  14  15 

g(x,y)  122333344  4 4 4 4 4 4 

For  x=0,  there  will  be  in  general  values  of  g(x,y)=z. 

Nov;  pick  some  other  random  value  of  x,  say  x=4. 

X 444444444444444 
y 0 1 2 3 5 6 7 8 9 10  11  12  13  14  15 

g(x,y)  33331224^444444 


so,  again,  there  are  2Z~1  values  of  g(x,y).  We  thus  have  a basis 
for  a change  of  variables,  7, — . ^ ^ 

= izil^ 5 (1^4) 

where  z=g(x,y)  and  Now  ffz*)  just  equals  (1/2)^;  and 

the  limits  of  the  summation  are  from  1 to  n,  where  n is  the  number 
of  levels. 

(1.5) 


-f(^) 


j-k)  -k  Az 


A2 


^ 2 
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and  since  ^ f 

then  rr-:  n (l*7) 

f(^J  "*  2(2.*^*^^) 

which  depends  only  on  n.  For  n»9,  f(z)  = 9/1022. 

1.6  TERM  ANALYSIS  FOR  RANDOM  BLOCKING 

Consider  a packed  Omega.  The  blocking  probability  for  two  inputs 
(i,j)  with  random  addresses  is  given  by  f(i»j)  ® (l/2)9ix,y). 

Thus  a matrix  can  be  made  out  of  the  f(i,j)'s.  In  general,  the 
f(i,j)'s  will  either  be  1/2,  1/4,  1/8,  1/16,  or  0.  In  particular, 
if  any  input  k has  no  request,  then  both  f(k,j)=*0  and  f(i,k)=0. 
Also,  f(i,i)=0. 

Here  notation  will  be  changed  so  that  it  is  more  in  line  with 
symbolic  logic  and  set  theory.  Let  aiu=  f(i,j)  and  not-aij=l 
-f(i,j).  Now  consider  the  prospects  for  adding  one  more  input  to 
the  net.  Inputs  can  be  added  from  left  to  tight  to  see  when  it 

becomes  probable  that  the  new  input  is  blocked.  The  probability 

that  input  0 is  blocked  by  0,  aoo»  is  zero,  of  course.  The  first 
terra  will  be  the  probability  that  1 is  blocked  by  0,  (i.e.  a^Q, 
which  equals  1/2  for  a packed  and  0 for  an  unpacked  Omega) . The 

second  term  will  be  the  probability  that  2 is  blocked  by  0,  but  1 

is  not,  ) which  is  1/4  x 1/2  for  a 

packed  Omega.  The  third  term  will  be  probability  that  1 blocks  2 
given  not-a20  and  not-aiQ,  ( i.e. 

The  next  term  will  be 


and  the  term  after  that  is 


P(cti,  30 A— XSCta  A-^OLaS)  . 


In  general,  the  kth  term  will  be  the  product  of  k such  atomic 
units,  of  which  k-1  are  negated.  In  the  iterative  procedure,  one 
would  have  a 'tail'  to  the  end  of  which  the  negated  form  of  the 
last  atomic  unit  is  multiplied  before  multiplying  by  the  new 
atomic  unit  obtained  from  the  matrix.  The  kth  term  is  then  summed 
to  the  present  value  of  the  first  k-1  terms. 
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When  in  the  course  of  this  procedure  the  current  value  of  this 
sununation  becomes  greater  than  1/2,  the  procedure  may  be  aban-* 
dcned.  The  fact  that  the  probability  rises  above  1/2  means  that 
it  is  expected  that  this  new  element  will  be  blocked.  A new 

procedure  is  now  adopted,  in  an  attempt  to  find  the  probability 
that  any  two  elements  are  blocked.  The  first  two  terms  are  zero. 
The  third  group  of  terms  • The  fourth  group 

of  terms  will  be  • 

The  fifth  group  will 

In  general,  for  the  kth  term  group,  there 
will  be  k-1  subgroups,  each  composed  of  k products.  The  k 

products  will  be  given  by  the  permuting  of  one  addtional 
affirmative  over  the  smallest  k-1  matrix  element,  on  one  side  of 
the  diagonal.  The  smallest  elements  are  defined  by  the  fact  that 
the  left  subscript  must  be  less  than  or  equal  to  that  of  the  newj 
and  the  right  subscript  must  be  less  than  it.  If  the  sum  of 
these  terms  at  any  time  is  greater  than  1/2,  this  procedure,  too, 
is  terminated  and  a procedure  which  tests  for  three  blocks  is 
implemented. 

In  general,  in  a procedure  looking  for  i blocks,  the  kth  group  of 
terms  will  hdi^e  0^0 subterms,  each  of  which  is  a per- 
mutation of  i-1  affirmative  ot  *s  over  k-1  cL  *s.  When  i is  large 
enough  so  that  the  whole  network  is  done  while  the  sum  is  less 
than  1/2,  then  this  is  just  the  expected  blocking  rate.  Since 
there  are  on  the  order  of  .5n^  groups  of  terms  (for  half  the 
atomic  coefficients  in  the  square  array,  ^ 

subterms  in  each  group,  then  there  are  at 

operations  in  this  procedure.  For  large  problems  there  are  many 
blocks,  and  i may  be  on  the  order  of  100,  making  the  computation 
even  more  prohibitive  than  that  suggested  in  Section  1.4. 

However,  there  may  be  a '‘coarser"  way  to  estimate  the  network 
biockinq  We  have  noted  that  there  are  k atomic  elements  in  each 
of  the  (A-i)/  subgroups.  The  minimum  number  of  elements 

in  any  subgroup  is  i,  for  i blockages.  For  the  kth  group,  each  of 
these  subgroups  will  be  composed  of  i affirmative  o<  *s  and  k-i 
negative  's.  Now  the  average  value  of  one  of  these  oC  is  as 
shown  in  Section  1.5,  n/2(2^-l)  or  log2N/2( N-1 ) . Similarly,  one 
could  show  that  the  average  value  of  the  function  l-(l/2)9(XfY)  is 
l“(log2N/2(N-l) ) . (Assume  that  this  average  value  of  each  of  the 
terms  found  in  this  way  will  be  good  estimators  for  the  product. 
Basically  this  average  says  that  the  typical  block  will  occur  at 
the  log2( 2(N-l)/n) th  level.  For  n=9,  this  is  about  6.  In  this 
way,  the  computation  can  be  drastically  reduced.)  Each  of  these 
groups  can  be  written  as  a product  of  one  of  these  estimator  terms 
times  the  number  of  such  terms.  And  since  there  are  such  groups 
for  all  k from  i to  N,  we  are  left  with  the  sum 
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which  depends  only  on  i and  N.  The  largest  such  term  occurs  for 
k=i,  and  is  just  (l-{  log2N/2(N-l) ) ^.  (Since  B(i  in  N)  is  greater 
than  its  first  term,  a good  way  to  make  a lower  limit 
approximation  for  i ie  to  use  the  least  i such  that  this  term  is 
leas  than  1/2.  For  N=512,  this  is  just  (1013/1022)1.)  The  last 
and  smallest  terra  in  the  series  for  k=N,  which  we  call  %in(i  in 
N)  may  be  written 


(iinN)  - O' 


Note  the  resemblance  between  the  part  in  brackets  and  the  form- 
ula in  Section  1.4,  with  N-i  corresponding  to  x.  One  major  dif- 
ference between  the  two  is  that  the  'i*  in  the  former  is  the 
number  of  blocks,  while  the  in  the  latter  is  the  number  of 

successes. 

1.7  STATE  OP  THE  CONNECTION  NETWORK 

One  of  the  networks  proposed  is  the  two-layer  unpacked  Omega  with 
bit  reversal  and  alternating  priorities  between  layers  and  cycles. 
In  fact,  it  is  suspected  that  a hardwired  processor-to-input 
randomization  would  work  as  well  as  a bit  reversal.  Any  priority 
rule  that  favors  the  left  port  will  favor  addresses  going  to  the 
left  side  of  the  network,  and  vice  versa.  However,  a random 
priority  rule,  where  the  priority  is  determined  by  a random  number 
from  1 to  4 (favoring  left,  right,  straight-through,  and  crossed) 
would  probably  be  optimal.  One  way  to  improve  the  priority  rules 
is  to  add  a bit  to  the  address  which  says;  "I  am  a success  so 
far.”  Then  if  there  is  a conflict,  and  if  one  or  the  other  of  the 
addresses  has  say  a 1 in  this  place,  then  the  switch  will  give 
that  addresss  the  priority. 

The  Benes  network  now  appears  suboptimal.  In  the  absence  of 

overall  control,  an  algorithm  must  be  developed  which  produces  an 
address  from  the  first  half  of  the  network.  The  algorithm  studied 
obtained  the  address  through  an  "exclusive  or"  on  the  processor 
and  memory  module  numbers.  This  algorithm  has  a serious  flaw  in 
it  for  any  unpacked  Benes.  As  long  as  the  ports  are  unpacked, 
both  processor  and  EM  numbers  are  even — except  for  the  nine 
odd-valued  memory  module  numbers.  The  fact  that  the  least 
significant  bit  is  zero  implies  that  the  addresses  go  straight 
through  on  the  first  level;  this  in  turn  implies  that  at  the 
middle  level  of  the  network  all  the  addresses  are  in  the  left  half 
of  the  switches.  Thus,  at  the  middle  layer,  the  addresses  are 
•re-packed*.  Also,  at  every  level  prior  to  the  middle  one,  only 
half  the  switches  are  used.  Needless  to  say,  this  seriously 
degrades  the  simulation  of  any  unpacked  Benes.  However,  this 
problem  should  be  rectifiable  with  a bit  reversal. 


Part  of  this  analysis  considered  various  alternative  network 
organizations  in  case  additional  throughput  is  required.  One  way 
to  improve  performance  is  to  ’double-unpack*  the  inputs,  so  that 
there  is  only  one  address  for  every  four  ports.  This  implies,  as 
well,  adding  another  level  to  the  network.  To  compare  the  effect 
of  doubling  the  number  of  ports  with  that  of  adding  another  layer, 
a packed  two-layer  Omega  and  a packed  one-layer  Omega  with  256 
requests  were  simulated.  If  the  simulation  is  to  scale  properly, 
one  would  expect  to  find  half  as  many  successes  with  256  inputs  to 
512  ports  as  with  512  inputs  to  1024  ports.  In  simulation,  the 
packed  two-layered  Omega  with  256  inputs  produced  70  successes. 
Thus  in  performance  a double-unpacking  appears  equivalent  to 
doubling  the  number  of  layers.  This  further  suggests  that  for  the 
first  cycle  the  success  rate  depends  only  on  the  number  of  input 
ports  for  an  Omega  network. 

Still  probably  the  best  way  to  reduce  blockages  experienced  in  the 
networks  studied  is  to  add  more  layers.  As  was  previously  noted, 
it  is  deceptive  to  think  in  terms  of  "Layers’*.  Any  n-layered 
network  of  2 x 2 switches  can  actually  be  represented  as  one  layer 
of  2n  X 2n  switches.  When  Benes  proved  his  theorem,  he  did  so  for 
any  Benes  network  of  square  switches,  i.e.,  n x n.  Now  for  an 
Omega  Network,  the  address  generates  the  path  at  the  switch  by  a 
simple  procedure  of  left-right  responses.  And  while  it  may  be 
difficult  to  construct  a local  control  procedure  fnr  binary 
addresses  or  odd-valued  n x n switches,  equivalences  can  be  set  up 
between  n inputs  and  n outputs  for  a 2n  x 2n  switch.  The  result- 
ing logic  at  each  switch  would  be  more  complicated  in  this  case. 
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APPENDIX  J 

II  DESIGN  GUIDELINES  FOR  NASP  PROCESSING  SYSTEM 

P J.J  SCOPE 

This  document  delineates  general  guidelines  that  may  be  used  in 
- the  design,  fabrication  and  assembly  of  the  hardware  required  for 

tl  the  Numerical  Aerodynamic  Simulation  Facility  (NASP)  Processing 

System. 

I J.2  DESIGN  CONTRAINTS 

a. 2.1  Environmental 

I The  environmental  limits  specified  represent  the  conditions  nor- 

mally found  in  most  laboratory  or  office  buildings  deemed  suitable 
for  professional  employees  and  assumes  that  air  conditioning  and 
1 other  controls  have  been  provided  to  attain  these  levels.  It  is 

■ incumbant  on  the  design  of  the  PMP  hardware  not  to  adversely 

affect  this  environment. 

J.2. 1.1  Atmospheric  Conditions 

Table  J.l  defines  the  limits  of  temperature,  humidity,  and  alti- 

tude for  operating,  non-operating,  storage  and  shipping  condi- 
tions. 

Dust  levels  may  exist  to  the  extent  resulting  from  a filtered  air 
conditioning  system  meeting  NBS  blackness  test  with  a minimum 
rating  of  50%  efficiency  using  atmospheric  dust. 

J.2. 1.2  Mechanical  Stress 

Table  J.2  delineates  the  mechanical  stress  levels  for  the  equip- 
ment installed  (operating  and  non-operating)  and  in  shipping  con- 
tainers. 

i Shock  is  defined  as  a non-periodic  mechanical  pulse  of  large  ampli- 

i tudo  about  a fixed  point. 

I Vibration  is  a steady  state  periodic  or  random  oscillation  which 

1 may  have  a sinusoidal  or  a complex  waveform  and  may  have  a single 

frequency  or  broad  spectrum. 

I J.2. 1.3  Acoustic  Noise 

The  equipment  should  not  be  affected  by  exposure  to  sound  pres- 
I sures  of  130  dB*  (c)  for  a period  of  30  minutes. 

5,  * Ref.  2 X 10^  dynes/cm2 
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TABLE  J.l  ATMOSPHERIC  CONDITIONS 

SHIPPING  AND 

OPERATING  NON-OPERATING  STORAGE 
(Installed)  (Installed) 

DRY  BULB  TEMP.  18°C  to  30°C  -40°c  to  50°c  -40°c  to  70°c 

WET  BULB  TEMP.  NA  3QOc  Max  40^0  Max. 

RELATIVE  HUMIDITY  40%  to  60%  90%  Max.  95%  Max. 

ALTITUDE  0-3  km  0-3  km  0-15  km 


TABLE  J.2  MECHANICAL  STRESS 


Operating  and  Shipping  and 

Non-Oper at ing  St  or  age 

(Installed)  (In  Shipping  container) 


SHOCK 

Peak  Acceleration  *5  g 5g 

Duration  •!  to  1 sec.  5 to  50  millisec. 

Waveshape  h sine  h sine 

Force  application  Horizontal  3 Orthogonal  axes 


VIBRATION 

Frequency  Range 
Peak  Acceleration 
Force  Application 


5 to  500  Hz 
• 1 9 

3 Orthogona.'^  axes 


5 to  500  Hz 
1.5  g 

3 Orthogonal  axes 


J.2.1 .4  Radiation 


The  equipment  should  not  be  affected  by  radiation  of  the  following 
intensities 


(1)  Stray  magnetic  fields 

(2)  External  RPI 


,0005  tesla 

1.0  Volt/Mcter  500  kHz  to 
lOGHz 


J.2.1, 5 Static  Electricity 

Externally  exposed  hardware  should  be  immune  to  static  electric 
discharges  of  up  to  10  kilovolts  from  500  pF  through  50  ohms. 

J.2.1. 6 Fungus 

Fungus  inert  parts  and  materials  should  be  used  to  the  greatest 
extent  possible.  Parts  or  materials  not  inr-rt  to  fungus  growth 
should  be  treated  with  fungicidal  material.  No  damage  to  parts  or 
material  should  result  from  treatment  with  fungicidal  material  or 
fungicidal  coating. 

j.2.2  Electromagnetic  Interference  Control 

The  NASP  equipment  should  be  compatible  with:  a)  other  electronic 

devices  operating  in  the  immediate  area,  and  b)  communications 
services.  Control  of  the  electromagnetic  emanations  from  the  NASF 
equipment  must  bi-  an  integral  part  of  overall  system  design. 

Based  on  the  nature  of  the  NASF  design  and  mission,  conducted  and 
radiated  emanations  shall  comply  with  the  limits  illustrated  on 
Figures  J.l  and  J.2  respectively. 

J.2.3  Acoustic  Noise  Control 


Personnel  should  be  provided  an  acoustical  environment  which  will 
not  interfere  with,  or  in  any  way  degrade  overall  NASP  effective- 
ness. To  ensure  compliance  with  this  requirement,  acoustic  noise 
levels  of  the  NASF  Processing  System  should  not  exceed  the  fol- 
lowing criteria: 


EQUIPMENT  AREAS 


OPERATOR  AREAS 


I/O  AREAS  (ELECTRO 
MECHANICAL  DEVICES) 


75  dB  (A)  * 
68  dB  SIL  ** 

65  dB  (A) 

58  dB  SIL 

80  dB  (A) 

71  dB  SIL 


* dB(A); 
Meter. 


Measurement  using  (A)  weighting  network  on  Sound  Level 


**  dBSIL:  Speech  Interference  Level,-  The  arithmetic  average  of 

the  sound-pressure  levels  in  the  octavf  bands  centered 
on  500,  1000,  and  2000  Hz. 
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80 


50 

741  176  30 

Frequency  MHz 

a)  Narrowband  Limit  - Average  Detector 

looj 


70 

.45  1.6  30 

Frequency  MHz 

b)  Broadband  l.imit  - OuasipeaJc  Detector 


* AVERAGE  DETECTOR:  A detector*  the  output  voltage  of  which 
approximates  the  time  average  value  of  the  envelope  of  an  applied 
signal.  Refer  to  ANSI  C63.2-197X 

**  QUASIPEAK  DETECTOR;  A detector  having  specified  electrical 
time  constants,  which,  when  regularly  repeated  pulses  of  constant 
amplitude  are  applied  to  it,  delivers  an  output  voltage  which  is  a 
fraction  of  the  peak  value  of  the  pulses,  the  fraction  increasing 
towards  unity  as  the  pulse  repetition  rate  is  increased.  Refer  to 
CISPR  Publications,  1,  2,  and  4. 

Figure  J.l  Conducted  Limits 
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(dB  above  1 jjV/m) 


Frequency  - MHz 

Broadband  Limit  - Quasipeak  Detector 
Narrowband  Limit  - Quasipeak  Detector 


Figure  J*2  Radiated  Limits  (30  Meters) 
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J . 2 . 4 Input  Power 


Tabic  J,3  and  J.4  delineate  the  minimum  quality  level  of  the  power 
thav-  should  be  available  for  the  NASF  Processing  System.  The 
Processing  System  should  have  its  own  power  control  and  distri- 
bution subsystem  that  will  operate  from  this  input  power  and 
supply  the  appropriate  power  to  the  various  hardware  elements  of 
the  processing  system. 

J.2.5  Design  and  Construction 

Unless  otherwise  specified,  the  NASF  Hardware  should  be  designed 
in  accordance  with  good  commercial  practices. 

J.2.5.1  Physical  Characteristics 

J.2.5. 1.1  Cabinets  - Removable  panels  and  doors  should  be 

utilized  to  enclose  the  structure. 

J.2.5. 1.2  Size  and  Weight  - No  single  unit,  cabinet  or  component 
should  exceed  3,600  pounds  or  exceed  the  following  dimensions: 

Height  72" 

Width  84" 

Depth  35" 

The  floor  loading  should  be  no  more  than  250  lbs/ft2  for  fully 
operable  equipment. 

J.2.5. 1.3  Marking 

J.2.5. 1.3.1  Marking  of  Equipment  - Each  major  assembly  should  be 
permanently  and  legibly  marked  with  the  manufacturer's  identifica- 
tion (name,  initials,  trademark,  code  number,  or  symbol)  serial 
number,  and  model  number.  Permission  shall  be  granted  to  the 
manufacturer  to  place  its  name/ symbol  on  the  front  of  the  equip- 
ment. 


J.2.5. 1.3. 2 Marking  of  Controls  - Controls  related  to  the  oper- 
ation or  conditioning  of  the  equipment,  either  remotely  or 
locally,  should  be  clearly  identified. 

J.2.5. 1.3. 3 Marking  of  Subassemblies  - All  removable  and  repair- 
able plug-in  subassemblies  should  be  identified  and  marked  with  a 
serial  number.  Labels  should  be  positioned  so  they  can  be  readily 
seen. 
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Table  J.3 

Power  Source  Description 


Total  Powers  750  KVA 

Frequency:  60  Hz 

Tolerance:  + 3% 

Rate  of  Change:  1.5  Hz/ Sec  Max 

voltage:  460  3 Phase^  3 Wire  plus  ground 

Range  of  slow-averaged  + 10%,  -15% 
rms  voltage  ( including 
brown  outs) 

Imbulance  5%  Max 
Modulation  1%  Max 
Harmonics  (total)  20%  Max 
Max  Any  Harmonic  10%  Max 
Deviation  Factor  25%  Max 
D.  C.  Component  1%  Max 


Table  J*4 

Power  Source  Transients,  Recovery 


and 


Capability 


♦Power  Source  TrniisientK 
(on  ony  or  all  phases) 
Cycle  or  Longer 
Mdxfmum  Transient  Suryc 


LnVKL  j 

To  130'.’  of  nominal  rms  " 

voltayo  recovering  to  J20V  j 

in  50ms  or  lc?ss,  tlicn  within  | 

llOV  in  3 sec.  or  loss 


REMARKS 

♦Vol  ta*jo  do  vi  at  ions  sl)all 
be  within  the  liirtits  shown 
wl>on  t)tc  util  ii:ai ion  vollayo 
is  within  its  toleranco  liuiils 
-15- ) 


Maximum  Transient  Say 


Loss_Than__l/2  Cycle  But 

More  than  10^  microseconds  Maximum 

Transient "surge  peak  volts 


Maximum  Overvoltage 
Transient  volt-seconds 


To  50'.  of  nominal  rms 
voltage,  recovering  to  70^. 
in  looms  or  Jess,  then  to 
85‘i^  or  more  in  0.5  sec.  or 
less 


150'..  of  nominal  peak  voltage 
(212^.  of  monial  rms  voltage) 
provided  than  volt-second 
limit  i;.  .not  exceeded 


Surge  component  only,  would 
bo  250  total  if  occur ing  *‘t 
wave  peak 


j 150^,  of  nominal  volt-seconds 
\ provided  surve  voltage  limit 
I above  is  not  exceeded 


Composite  wave  form 


Maximum  Transient  sag  peak 


To  2oro  volts 


Maximum  Undervoltagc  Transient 
volt-seconds 

Impulses  .(cither  polarity  U)0  us^ecs . 
or  less,^  RFI  Bursts 


Maximum  Voltage  Deviation 
of  RPI.  lOkHz  or  greater 


To  zero  for  1/2  cycle 


i 


*100.  of  nominal  peak  voltage*  | Impulso  component  only,  would 
(566'  of  nominal  rms  v<>ltagc)|  be  500^  if  occur  ing  at  wave 

! peak 


Event  Rate 

Powe r Source  Capa h i 1 i t y 
Peak  Inrush  Limit 

Load  Imbalance 

Source  rmpedance 

Ground  Return  Impedance 


Maximum  of  10  in  10  minutes 
and  at  least  6 seconds  be- 
tween maximum  limit  events 
and  full  recovery  to  speci- 
fied rangi.*  of  rms  voltage 
between  event  » 


4kVA  or  3 to  B X rated  kVA 
of  load 

I 25"  max  or  1 OkVA  wluchev'cr 
jis  grealt  r 

jO.5"  to  5V.  of  Raft'd  "Basi'*' 
j ohms  at  the  power  frequency 

I 

I Impi?danee  low  enou<di  to 
jgrouml  fault  eurrv>nt  of 
■breaker  trip  rating  for 
I fault 


i 

'starting  coivUtion  for  al  1 
evi-nts  shall  be  within  speci- 
fied range  cf  rms  voltage  and 
jmay  be  at  worst  case  to  maxi- 
j m i V e t he  event . 


cre«>l<[’ 
3 0 X 
g round 
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J. 2. 5 *1.4  Accessibility  - All  adjustments  and  any  other  work 

required  after  the  system  is  assembled  should  be  readily  access- 
ible for  servicing*  Special  tools  or  other  mechanical  devices 
required  to  readily  and  accurately  adjust  the  equipment  should  be 
supplied  as  part  of  the  unit*  The  equipment  should  be  designed  to 
protect  operating  and  maintenance  personnel  from  contact  with 
hazardous  devices.  The  equipment  may,  however,  be  designed  to 
operate  with  the  frame  covers  removed  (but  not  necessarily  meet 
acoustic  and  EMI  requirements  under  these  conditions) * 

J*2*5*l*5  Grounding  - Circuit  grounds  should  be  isloated  from 

chassis  grounds  but  may  be  connected  to  the  chassis  if  required* 
Ground  connections  to  chassis  should  be  mechanically  secured  by 
soldered  terminals  and  locked  by  means  of  a lockwasher  and  nut*  A 
chassis  ground  tiepoint  should  be  provided  at  the  power  interface 
of  individual  cabinets  and  major  elements  of  the  system* 

J. 2. 5*1*6  Mechanical  Operation  - All  controls  should  operate 
freely  and  smoothly  without  binding , scraping,  or  cutting.  Play 
and  backlash  should  be  minimized  and  should  not  cause  poor  contact 
or  inaccurate  setting* 

J *2*5 *1*7  Transportabil ity  - The  elements  of  the  NASF  Computing 
System  should  be  transportable  by  qualified  domestic  common  car- 
rier without  damage  or  deterioration  when  packaged,  preserved,  and 
prepared  for  shipment* 

J*2*5.2  Materials,  Processes,  and  Parts 

J. 2. 5*2*1  Parts  Selection  - The  following  principles  should  be 
utilized  in  par ts  se 1 ec ti on : 

(a)  The  variety  and  types  of  parts  required  should  be  kept 
to  a minimum* 

(b)  Common  and  regularly  stocked  parts  should  be  used 
whenever  feasible  to  simplify  maintenance,  storage  and 
supply. 

(c)  The  use  of  proprietary  components  should  be  avoided  where 
practicable. 

J. 2*5*3  Workmanship  - The  equipment  should  be  processed  in  such  a 
manner  as  to  be  uniform  in  quality  and  free  from  defects  that  will 
adversely  affect  life,  serviceability,  and  appearance.  All  metal 
surfaces  should  be  clean  and  free  from  burrs,  roughness,  oxide, 
scales,  and  sharp  edges.  Printed  circuit  boards  should  be  free 
from  cold  soldering,  corrosion,  salts,  smut,  grease,  finger 
prints,  flux  residue,  and  foreign  materials. 
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Product  Safety 


The  NASF  equipment  should  be  designed  and  constructed  so  that  in 
normal  operation  and  maintenance  the  equipment  will  function 
reliably  without  causing  injuries  to  persons  or  damage  to  prop- 
erty, considering  possible  careless  use  that  may  occur  in  normal 
service.  Specifically,  the  equipment  should  be  designed  to  comply 
with  the  requirements  of  Underwriters  Laboratories  (UL)  Standard 
for  Safety,  Data  processing  Units  and  Systems,  UL  478. 

3.2.1  Service  Life 

The  intended  life  of  the  system  as  a system  is  ten  years.  Service 
life  is  the  anticipated  life  of  the  system,  as  a system,  without 
reference  to  the  anticipated  useful  life  of  the  parts  of  the 
system  and,  therefore,  assumes  that  necessary  maintenance  and 
repairs  will  be  performed  as  required. 
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